importccl: parallelize CSV-skipping workload IMPORT by dt · Pull Request #36060 · cockroachdb/cockroach

dt · 2019-03-22T16:44:24Z

(only last commit, first two commits are #36042)

This spins up multiple workers, each importing every i'th batch, to do
workload IMPORT.

As noted inline, this execution order, as opposed to assinging large
spans of batches to each worker, should mean that adjacent batches are
processed at roughly the same time and thus end up in the same
sort-batch for SST creation, preserving the non-overlapping SSTs when
the workload's batches are ordered and non-overlapping.

Release note: none.

This extracts the logic for only sending a progress update when progress has meaningfully advanced and limiting their frequency from the existing logger, allowing it to be potentially reused elsewhere. Release note: none.

Release note: none.

cockroach-teamcity · 2019-03-22T16:44:34Z

This change is

danhhz

heh, I have a local PR that does this, which I think has some advantages over what you have here, but your insight about doing them in order is a good one and my impl doesn't do that. Just pushed it to 5f728d9 for reference. Take a look at some of the cleanups I did

Include the two benchmarks I have in my commit message in your commit message, please.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @dt)

pkg/ccl/importccl/read_import_workload.go, line 33 at r3 (raw file):

type workloadReader struct {
	evalCtx *tree.EvalContext

One of the other impls says we need an evalCtx per worker because it's not threadsafe

pkg/ccl/importccl/read_import_workload.go, line 146 at r3 (raw file):

		// together and then ingested together in the same SST, minimzing the amount
		// of overlapping SSTs.
		// TODO(dt): on very long imports, these could drift. We might want to check

Much simpler than this and guaranteed to work is to have each worker use an atomic int to grab batches to process. I think this is worth doing in the initial PR instead of what you have here.

batchIdxAtomic := int64(conf.BatchBegin-1)
for i := 0; i < workers; i++ {
g.GoCtx(func(ctx context.Context) error {
for {
batch := int(atomic.AddInt64(&batchIdxAtomic, 1)
if batch >= conf.BatchEnd {
break
}
...
}
}
}

pkg/ccl/importccl/read_import_workload.go, line 150 at r3 (raw file):

		// that we're significantly above it.
		for i := 0; i < workers; i++ {
			thread := i

s/thread/goroutine/ everywhere in this PR

pkg/ccl/importccl/read_import_workload.go, line 152 at r3 (raw file):

			thread := i
			g.GoCtx(func(ctx context.Context) error {
				conv, err := newRowConverter(w.table, w.evalCtx, w.kvCh)

move everything in this body into a method on (*workloadReader)

pkg/ccl/importccl/read_import_workload.go, line 161 at r3 (raw file):

				var rowIdx int64
				for b := begin + thread; b < end; b += workers {
					// log.Infof(ctx, "%s thread %d of %d importing batch %d", t.Name, thread, workers, b)

intentional?

This spins up multiple workers, each importing every i'th batch, to do workload IMPORT. As noted inline, this execution order, as opposed to assinging large spans of batches to each worker, should mean that adjacent batches are processed at roughly the same time and thus end up in the same sort-batch for SST creation, preserving the non-overlapping SSTs when the workload's batches are ordered and non-overlapping. Release note: none.

dt

Added benchmarks -- they don't look quite the same as yours but are similar in shape and our starting times were pretty different so I'm guessing it is mostly hardware.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)

pkg/ccl/importccl/read_import_workload.go, line 33 at r3 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

One of the other impls says we need an evalCtx per worker because it's not threadsafe

Huh, testrace didn't seem to mind, but we only use a couple fields of the ctx (for now) so maybe it just didn't hit the mutation part. easy enough to make a copy for each and not worry about tit.

pkg/ccl/importccl/read_import_workload.go, line 146 at r3 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Much simpler than this and guaranteed to work is to have each worker use an atomic int to grab batches to process. I think this is worth doing in the initial PR instead of what you have here.

batchIdxAtomic := int64(conf.BatchBegin-1)
for i := 0; i < workers; i++ {
g.GoCtx(func(ctx context.Context) error {
for {
batch := int(atomic.AddInt64(&batchIdxAtomic, 1)
if batch >= conf.BatchEnd {
break
}
...
}
}
}

Many of the workloads appear to do one row / batch so I was trying to avoid having to do any cross-thread coordination for each row, but I guess in the bigger scheme of things, a barrier is probably trivial. It might also be the case that we should be making the Batch func return more than 1 row per call and that would probably have other benefits too (e.g. could bulk-alloc more).

Done.

pkg/ccl/importccl/read_import_workload.go, line 150 at r3 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

s/thread/goroutine/ everywhere in this PR

Done.

pkg/ccl/importccl/read_import_workload.go, line 152 at r3 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

move everything in this body into a method on (*workloadReader)

Done

well, actually introduced a new struct to hold extra per-file args it needs to close over and then a method on that.

pkg/ccl/importccl/read_import_workload.go, line 161 at r3 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

intentional?

Done.

danhhz

Sorry if I'm being dense, but I don't see the benchmarks.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @dt)

pkg/ccl/importccl/read_import_workload.go, line 146 at r3 (raw file):

Previously, dt (David Taylor) wrote…

Many of the workloads appear to do one row / batch so I was trying to avoid having to do any cross-thread coordination for each row, but I guess in the bigger scheme of things, a barrier is probably trivial. It might also be the case that we should be making the Batch func return more than 1 row per call and that would probably have other benefits too (e.g. could bulk-alloc more).

Done.

Yeah, I had the same thought, but it's easy to later make this grab more than 1 at a time if we have contention issues. (I suspect we won't for a while because the work of converting a row into kvs is much bigger than one atomic add.) Making batch funcs return more than one row is coming.

pkg/ccl/importccl/read_import_workload.go, line 142 at r4 (raw file):

		workers := workloadReaderWorker{
			w:                w,

let's denormalize whatever this needs from w into fields on this struct

pkg/ccl/importccl/read_import_workload.go, line 168 at r4 (raw file):

// in the SST builder, minimzing the amount of overlapping SSTs ingested.
func (w *workloadReaderWorker) run(ctx context.Context) error {
	evalCtx := w.w.newEvalCtx()

move this to where we instantiate the workloadReaderWorker

dt · 2019-03-25T17:19:12Z

ugh, I realized I had accidentally detached HEAD when I was benchmarking and that's why the amend got lost, but apparently when I then pushed, it instead deleted the branch ... and GitHub says it can't be reopened.

dt

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @dt)

pkg/ccl/importccl/read_import_workload.go, line 142 at r4 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

let's denormalize whatever this needs from w into fields on this struct

happy to do so, but it uses all three fields of w

dt

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @dt)

pkg/ccl/importccl/read_import_workload.go, line 168 at r4 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

move this to where we instantiate the workloadReaderWorker

so the func is the worker -- there is only one struct instantiated. On second though, I'm going back to no helper struct -- it seems like it complicates it more than just having an anon func inline above to close over all the per-file locals in the readFile loop.

36106: importccl: parallelize CSV-skipping workload IMPORT r=dt a=dt (only last commit, first two commits are #36042) Reopening #36060 after accidentally deleting branch and then github said it cannot re-open :/ This spins up multiple workers, each importing every i'th batch, to do workload IMPORT. As noted inline, this execution order, as opposed to assinging large spans of batches to each worker, should mean that adjacent batches are processed at roughly the same time and thus end up in the same sort-batch for SST creation, preserving the non-overlapping SSTs when the workload's batches are ordered and non-overlapping. Release note: none. Co-authored-by: David Taylor <tinystatemachine@gmail.com>

dt added 2 commits March 21, 2019 22:12

jobs: break out progress batcher/rate limiter

8d6ecaf

This extracts the logic for only sending a progress update when progress has meaningfully advanced and limiting their frequency from the existing logger, allowing it to be potentially reused elsewhere. Release note: none.

jobs: make ProgressUpdateBatcher threadsafe

5f28929

Release note: none.

dt requested review from a team and danhhz March 22, 2019 16:44

dt force-pushed the workload-par branch from a662ffe to fcdf8d1 Compare March 22, 2019 19:03

danhhz reviewed Mar 25, 2019

View reviewed changes

dt force-pushed the workload-par branch from fcdf8d1 to 489557f Compare March 25, 2019 16:27

dt commented Mar 25, 2019

View reviewed changes

danhhz reviewed Mar 25, 2019

View reviewed changes

dt closed this Mar 25, 2019

dt deleted the workload-par branch March 25, 2019 17:14

dt commented Mar 25, 2019

View reviewed changes

dt mentioned this pull request Mar 25, 2019

importccl: parallelize CSV-skipping workload IMPORT #36106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importccl: parallelize CSV-skipping workload IMPORT#36060

importccl: parallelize CSV-skipping workload IMPORT#36060
dt wants to merge 3 commits intocockroachdb:masterfrom
dt:workload-par

dt commented Mar 22, 2019

Uh oh!

cockroach-teamcity commented Mar 22, 2019

Uh oh!

danhhz left a comment

Uh oh!

dt left a comment

Uh oh!

danhhz left a comment

Uh oh!

dt commented Mar 25, 2019

Uh oh!

dt left a comment

Uh oh!

dt left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dt commented Mar 22, 2019

Uh oh!

cockroach-teamcity commented Mar 22, 2019

Uh oh!

danhhz left a comment

Choose a reason for hiding this comment

Uh oh!

dt left a comment

Choose a reason for hiding this comment

Uh oh!

danhhz left a comment

Choose a reason for hiding this comment

Uh oh!

dt commented Mar 25, 2019

Uh oh!

dt left a comment

Choose a reason for hiding this comment

Uh oh!

dt left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants