importccl: skip detour though CSV when importing from workload by dt · Pull Request #34892 · cockroachdb/cockroach

dt · 2019-02-14T05:46:58Z

Based on some recent suggestions from @danhhz, while I was thinking about ways to fix job progress reporting for workload imports, I looked at how hard it would be to skip our detour though CSV when importing on-the-fly workload-generated data, since that detour is what made progress reporting hard. Specifically, it was harder because the on-the-fly CSV generation doesn't know up front how many bytes it will return. The usual IMPORT progress tracking used by the file-reading frontends isnum_bytes_read/num_bytes but that doesn't work if you don't know num_bytes.

However, the workload generator itself certainly know how many rows it has generated and how many it will generate as it goes. If we hook it up directly, without a detour though a "file" of CSV data, we can just report that number for progress directly.

That is what got me looking at this, but more importantly, skipping the detour to CSV lets us skip a lot of expensive string encoding and decoding in some cases.

This is just a first pass at this -- mostly to prove it works and solved my original progress concern, but this initial implementation does not even begin to exploit all the possible optimizations. However it likely makes testing some of them much easier.

The specialized frontend is automatically by reader process when it is configured to read CSV data and all files URIs are pointing to workload generation-backed storage. In that case, it uses the workload configuration values from those file URIs to setup a generator directly and then writes its output to the same row-converter that the CSV reader would have -- simply cutting the CSV encoder and decoder out of the middle.

This is currently gated behind the COCKROACH_IMPORT_WORKLOAD_FASTER env var.

This is a baby-step towards one of the areas identified in #34809.

Release note: none.

cockroach-teamcity · 2019-02-14T05:47:06Z

This change is

danhhz

this is awesome!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @dt)

pkg/ccl/importccl/read_import_workload.go, line 1 at r2 (raw file):

// Copyright 2018 The Cockroach Authors.

nit: 2019

pkg/ccl/importccl/read_import_workload.go, line 53 at r2 (raw file):

// this tries to fast-path a few workload-generated things to dodge the parse
// TODO(dt): have workload's Batch func return Datums directly as the fast-path.

agree with the sentiment, but I'd love to stay away from Datums in workload. too many allocations. jordan is pressing pretty hard on figuring out our in-memory columnar data representation (which will be much more allocation healthy), so my current plan is just to wait for that

pkg/ccl/importccl/read_import_workload.go, line 66 at r2 (raw file):

		return tree.NewDInt(tree.DInt(t)), nil
	case string:
		return tree.ParseDatumStringAs(hint, t, evalCtx)

It's probably not hard to figure out the various cases and avoid ParseDatumStringAs. I'd also love to go in here and reuse some datums and I'd love to get this merged for the better progress tracking, so happy to do this myself as a followup

dt

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz)

pkg/ccl/importccl/read_import_workload.go, line 1 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: 2019

Done.

pkg/ccl/importccl/read_import_workload.go, line 53 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

agree with the sentiment, but I'd love to stay away from Datums in workload. too many allocations. jordan is pressing pretty hard on figuring out our in-memory columnar data representation (which will be much more allocation healthy), so my current plan is just to wait for that

A workload could return its strings tagged with the type aliases, like tree.DString instead of string, without any additional allocs, right?

They wouldn't implement tree.Datum as non-pointers, but they would carry the the additional info about what type workload actually wanted, along to whatever is charged with transforming them into KVs (i.e this func for now). Anyway, happy to revisit in followup.

pkg/ccl/importccl/read_import_workload.go, line 66 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

It's probably not hard to figure out the various cases and avoid ParseDatumStringAs. I'd also love to go in here and reuse some datums and I'd love to get this merged for the better progress tracking, so happy to do this myself as a followup

Yeah, my plan was to keep it as a fallback but try to identify the hottest paths via profiles of the workloads that we care about and hit those e,g. just matching the ints appeared to get it from 30% of CPU down to ~15% in one tpcc profile

This is a minor refactor that just inverts the calling pattern of a common helper. Recognizing that most of the file-reading import frontends needed to do the same work of opening a file, setting up decompression and progress tracking, etc, we added a common helper that took one of the frontend's raw file handling func and called it after doing the commono setup. This simply inverts that -- the frontends get the files to convert, then can (and do) choose to call the helper internally and pass it their own per-file func. This paves the way for frontends that might want to do something differnet. Release note: none.

Based on suggestions from Dan and in an attempt to fix job progress reporting for workload imports (which previously generated CSV data on the fly for the CSV reader to consume, but did not know how to estimate a size and thus did not get incremental progress reporting), this is a prototype of using a dedicated IMPORT frontend for producing workload KV data. This is just a first pass at this -- mostly to prove the progress reporting would work -- and does not even begin to exploit all the possible optimizations using a separate frontend woud make not only possible but even pretty easy. E.g. This still uses strings and parsing for non-int types. Release note: none.

dt · 2019-02-19T15:55:40Z

I changed up the datum converter a tad last week so this is ready for another look

danhhz

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @dt)

pkg/ccl/importccl/read_import_workload.go, line 53 at r2 (raw file):

Previously, dt (David Taylor) wrote…

A workload could return its strings tagged with the type aliases, like tree.DString instead of string, without any additional allocs, right?

They wouldn't implement tree.Datum as non-pointers, but they would carry the the additional info about what type workload actually wanted, along to whatever is charged with transforming them into KVs (i.e this func for now). Anyway, happy to revisit in followup.

Oh the type aliases is an interesting idea. We're definitely agreed that this would be a followup

dt · 2019-02-19T17:58:40Z

bors r+

@danhhz

34892: importccl: skip detour though CSV when importing from workload r=dt a=dt Based on some recent suggestions from @danhhz, while I was thinking about ways to fix job progress reporting for workload imports, I looked at how hard it would be to skip our detour though CSV when importing on-the-fly workload-generated data, since that detour is what made progress reporting hard. Specifically, it was harder because the on-the-fly CSV generation doesn't know up front how many bytes it will return. The usual IMPORT progress tracking used by the file-reading frontends is`num_bytes_read/num_bytes` but that doesn't work if you don't know `num_bytes`. However, the workload generator itself certainly know how many rows it has generated and how many it will generate as it goes. If we hook it up directly, without a detour though a "file" of CSV data, we can just report that number for progress directly. That is what got me looking at this, but more importantly, skipping the detour to CSV lets us skip a lot of expensive string encoding and decoding in some cases. This is just a first pass at this -- mostly to prove it works and solved my original progress concern, but this initial implementation does not even begin to exploit all the possible optimizations. However it likely makes testing some of them much easier. The specialized frontend is automatically by reader process when it is configured to read CSV data and all files URIs are pointing to workload generation-backed storage. In that case, it uses the workload configuration values from those file URIs to setup a generator directly and then writes its output to the same row-converter that the CSV reader would have -- simply cutting the CSV encoder and decoder out of the middle. This is currently gated behind the `COCKROACH_IMPORT_WORKLOAD_FASTER` env var. This is a baby-step towards one of the areas identified in #34809. Release note: none. Co-authored-by: David Taylor <tinystatemachine@gmail.com>

craig · 2019-02-19T18:29:06Z

Build succeeded

GitHub CI (Cockroach)

dt requested review from a team, danhhz and madelynnblue February 14, 2019 05:46

dt force-pushed the import-workload-progress branch from 3664784 to aba3941 Compare February 14, 2019 13:16

danhhz mentioned this pull request Feb 14, 2019

importccl: blazing fast IMPORT #34809

Closed

6 tasks

madelynnblue approved these changes Feb 14, 2019

View reviewed changes

danhhz reviewed Feb 15, 2019

View reviewed changes

dt force-pushed the import-workload-progress branch 2 times, most recently from a574aaa to 89eec68 Compare February 16, 2019 16:19

dt commented Feb 16, 2019

View reviewed changes

dt added 2 commits February 16, 2019 16:23

dt force-pushed the import-workload-progress branch from 89eec68 to 4c1a5b1 Compare February 16, 2019 16:23

danhhz approved these changes Feb 19, 2019

View reviewed changes

craig bot merged commit 4c1a5b1 into cockroachdb:master Feb 19, 2019

dt deleted the import-workload-progress branch February 19, 2019 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importccl: skip detour though CSV when importing from workload#34892

importccl: skip detour though CSV when importing from workload#34892
craig[bot] merged 2 commits intocockroachdb:masterfrom
dt:import-workload-progress

dt commented Feb 14, 2019 •

edited

Loading

Uh oh!

cockroach-teamcity commented Feb 14, 2019

Uh oh!

danhhz left a comment

Uh oh!

dt left a comment

Uh oh!

dt commented Feb 19, 2019

Uh oh!

danhhz left a comment

Uh oh!

dt commented Feb 19, 2019

Uh oh!

craig bot commented Feb 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dt commented Feb 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Feb 14, 2019

Uh oh!

danhhz left a comment

Choose a reason for hiding this comment

Uh oh!

dt left a comment

Choose a reason for hiding this comment

Uh oh!

dt commented Feb 19, 2019

Uh oh!

danhhz left a comment

Choose a reason for hiding this comment

Uh oh!

dt commented Feb 19, 2019

Uh oh!

craig bot commented Feb 19, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dt commented Feb 14, 2019 •

edited

Loading