Skip to content

importccl: skip detour though CSV when importing from workload#34892

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
dt:import-workload-progress
Feb 19, 2019
Merged

importccl: skip detour though CSV when importing from workload#34892
craig[bot] merged 2 commits intocockroachdb:masterfrom
dt:import-workload-progress

Conversation

@dt
Copy link
Copy Markdown
Contributor

@dt dt commented Feb 14, 2019

Based on some recent suggestions from @danhhz, while I was thinking about ways to fix job progress reporting for workload imports, I looked at how hard it would be to skip our detour though CSV when importing on-the-fly workload-generated data, since that detour is what made progress reporting hard. Specifically, it was harder because the on-the-fly CSV generation doesn't know up front how many bytes it will return. The usual IMPORT progress tracking used by the file-reading frontends isnum_bytes_read/num_bytes but that doesn't work if you don't know num_bytes.

However, the workload generator itself certainly know how many rows it has generated and how many it will generate as it goes. If we hook it up directly, without a detour though a "file" of CSV data, we can just report that number for progress directly.

That is what got me looking at this, but more importantly, skipping the detour to CSV lets us skip a lot of expensive string encoding and decoding in some cases.

This is just a first pass at this -- mostly to prove it works and solved my original progress concern, but this initial implementation does not even begin to exploit all the possible optimizations. However it likely makes testing some of them much easier.

The specialized frontend is automatically by reader process when it is configured to read CSV data and all files URIs are pointing to workload generation-backed storage. In that case, it uses the workload configuration values from those file URIs to setup a generator directly and then writes its output to the same row-converter that the CSV reader would have -- simply cutting the CSV encoder and decoder out of the middle.

This is currently gated behind the COCKROACH_IMPORT_WORKLOAD_FASTER env var.

This is a baby-step towards one of the areas identified in #34809.

Release note: none.

@dt dt requested review from a team, danhhz and madelynnblue February 14, 2019 05:46
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@dt dt force-pushed the import-workload-progress branch from 3664784 to aba3941 Compare February 14, 2019 13:16
@danhhz danhhz mentioned this pull request Feb 14, 2019
6 tasks
Copy link
Copy Markdown
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: this is awesome!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @dt)


pkg/ccl/importccl/read_import_workload.go, line 1 at r2 (raw file):

// Copyright 2018 The Cockroach Authors.

nit: 2019


pkg/ccl/importccl/read_import_workload.go, line 53 at r2 (raw file):

// this tries to fast-path a few workload-generated things to dodge the parse
// TODO(dt): have workload's Batch func return Datums directly as the fast-path.

agree with the sentiment, but I'd love to stay away from Datums in workload. too many allocations. jordan is pressing pretty hard on figuring out our in-memory columnar data representation (which will be much more allocation healthy), so my current plan is just to wait for that


pkg/ccl/importccl/read_import_workload.go, line 66 at r2 (raw file):

		return tree.NewDInt(tree.DInt(t)), nil
	case string:
		return tree.ParseDatumStringAs(hint, t, evalCtx)

It's probably not hard to figure out the various cases and avoid ParseDatumStringAs. I'd also love to go in here and reuse some datums and I'd love to get this merged for the better progress tracking, so happy to do this myself as a followup

@dt dt force-pushed the import-workload-progress branch 2 times, most recently from a574aaa to 89eec68 Compare February 16, 2019 16:19
Copy link
Copy Markdown
Contributor Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz)


pkg/ccl/importccl/read_import_workload.go, line 1 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: 2019

Done.


pkg/ccl/importccl/read_import_workload.go, line 53 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

agree with the sentiment, but I'd love to stay away from Datums in workload. too many allocations. jordan is pressing pretty hard on figuring out our in-memory columnar data representation (which will be much more allocation healthy), so my current plan is just to wait for that

A workload could return its strings tagged with the type aliases, like tree.DString instead of string, without any additional allocs, right?

They wouldn't implement tree.Datum as non-pointers, but they would carry the the additional info about what type workload actually wanted, along to whatever is charged with transforming them into KVs (i.e this func for now). Anyway, happy to revisit in followup.


pkg/ccl/importccl/read_import_workload.go, line 66 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

It's probably not hard to figure out the various cases and avoid ParseDatumStringAs. I'd also love to go in here and reuse some datums and I'd love to get this merged for the better progress tracking, so happy to do this myself as a followup

Yeah, my plan was to keep it as a fallback but try to identify the hottest paths via profiles of the workloads that we care about and hit those e,g. just matching the ints appeared to get it from 30% of CPU down to ~15% in one tpcc profile

dt added 2 commits February 16, 2019 16:23
This is a minor refactor that just inverts the calling pattern of a common helper.
Recognizing that most of the file-reading import frontends needed to do the same
work of opening a file, setting up decompression and progress tracking, etc, we
added a common helper that took one of the frontend's raw file handling func and
called it after doing the commono setup.

This simply inverts that -- the frontends get the files to convert, then can (and do)
choose to call the helper internally and pass it their own per-file func. This paves
the way for frontends that might want to do something differnet.

Release note: none.
Based on suggestions from Dan and in an attempt to fix job progress reporting for workload imports
(which previously generated CSV data on the fly for the CSV reader to consume, but did not know how to
estimate a size and thus did not get incremental progress reporting), this is a prototype of using  a
dedicated IMPORT frontend for producing workload KV data.

This is just a first pass at this -- mostly to prove the progress reporting would work -- and does not
even begin to exploit all the possible optimizations using a separate frontend woud make not only possible
but even pretty easy. E.g. This still uses strings and parsing for non-int types.

Release note: none.
@dt dt force-pushed the import-workload-progress branch from 89eec68 to 4c1a5b1 Compare February 16, 2019 16:23
@dt
Copy link
Copy Markdown
Contributor Author

dt commented Feb 19, 2019

I changed up the datum converter a tad last week so this is ready for another look

Copy link
Copy Markdown
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @dt)


pkg/ccl/importccl/read_import_workload.go, line 53 at r2 (raw file):

Previously, dt (David Taylor) wrote…

A workload could return its strings tagged with the type aliases, like tree.DString instead of string, without any additional allocs, right?

They wouldn't implement tree.Datum as non-pointers, but they would carry the the additional info about what type workload actually wanted, along to whatever is charged with transforming them into KVs (i.e this func for now). Anyway, happy to revisit in followup.

Oh the type aliases is an interesting idea. We're definitely agreed that this would be a followup

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Feb 19, 2019

bors r+

craig bot pushed a commit that referenced this pull request Feb 19, 2019
34892: importccl: skip detour though CSV when importing from workload r=dt a=dt

Based on some recent suggestions from @danhhz, while I was thinking about ways to fix job progress reporting for workload imports, I looked at how hard it would be to skip our detour though CSV when importing on-the-fly workload-generated data, since that detour is what made progress reporting hard. Specifically, it was harder because the on-the-fly CSV generation doesn't know up front how many bytes it will return. The usual IMPORT progress tracking used by the file-reading frontends is`num_bytes_read/num_bytes` but that doesn't work if you don't know `num_bytes`. 

However, the workload generator itself certainly know how many rows it has generated and how many it will generate as it goes. If we hook it up directly, without a detour though a "file" of CSV data, we can just report that number for progress directly. 

That is what got me looking at this, but more importantly, skipping the detour to CSV lets us skip a lot of expensive string encoding and decoding in some cases.

This is just a first pass at this -- mostly to prove it works and solved my original progress concern, but this initial implementation does not even begin to exploit all the possible optimizations. However it likely makes testing some of them much easier.

The specialized frontend is automatically by reader process when it is configured to read CSV data and all files URIs are pointing to workload generation-backed storage. In that case, it uses the workload configuration values from those file URIs to setup a generator directly and then writes its output to the same row-converter that the CSV reader would have -- simply cutting the CSV encoder and decoder out of the middle.

This is currently gated behind the `COCKROACH_IMPORT_WORKLOAD_FASTER` env var.

This is a baby-step towards one of the areas identified in #34809.

Release note: none.

Co-authored-by: David Taylor <tinystatemachine@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 19, 2019

Build succeeded

@craig craig bot merged commit 4c1a5b1 into cockroachdb:master Feb 19, 2019
@dt dt deleted the import-workload-progress branch February 19, 2019 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants