Skip to content

sql: use distsql for local distinct#25873

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
rjnn:distinct_local_pr
Jun 12, 2018
Merged

sql: use distsql for local distinct#25873
craig[bot] merged 2 commits intocockroachdb:masterfrom
rjnn:distinct_local_pr

Conversation

@rjnn
Copy link
Copy Markdown
Contributor

@rjnn rjnn commented May 23, 2018

Use the distsql distinct processor for processing local sql queries. This is a work-in-progress to demonstrate and begin a discussion around the shims used.

This PR is rebased on top of #25860, so only evaluate the last commit. The main interfaces on which I'd like discussion are RowSourceToPlanNode and PlanNodeToRowSource. The rest of this PR is exporting changes (2nd commit), some planning plumbing, and code removal in local distinct.

@rjnn rjnn requested review from a team and jordanlewis May 23, 2018 20:15
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

I've only had time to skim this so far. Replacing localsql distinct with distsql distinct wasn’t as involved as I expected. It’s worthwhile to see if there is any performance diff from the change (i.e. due to the glue code).


Review status: 0 of 19 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@rjnn
Copy link
Copy Markdown
Contributor Author

rjnn commented May 24, 2018

Yup, this was fairly quick work, until I got to tuples!

There will be performance penalty due to the useless EncDatum wrapper which will churn through additional allocations, but I'll run some actual benchmarks to give you actual numbers.


Comments from Reviewable

@RaduBerinde
Copy link
Copy Markdown
Member

I like the approach, looks pretty straightforward.


Review status: 0 of 19 files reviewed at latest revision, all discussions resolved, some commit checks failed.


pkg/sql/distsql_physical_planner.go, line 2196 at r3 (raw file):

}

func CreateDistinctSpec(n *distinctNode, cols []int) *distsqlrun.DistinctSpec {

unexport


pkg/sql/plan_node_to_row_source.go, line 11 at r3 (raw file):

)

type planNodeToRowSource struct {

What is this to be used for?


pkg/sql/row_source_to_plan_node.go, line 18 at r3 (raw file):

// be constructed with Create(), after which it is a PlanNode and can be treated
// as such.
type RowSourceToPlanNode struct {

Does this need to be exported?


pkg/sql/row_source_to_plan_node.go, line 50 at r3 (raw file):

	}

	if len(r.row) == 0 {

The check should be for r.row == nil. Processors can in principle output empty rows.


pkg/sql/row_source_to_plan_node.go, line 51 at r3 (raw file):

	if len(r.row) == 0 {
		r.source.ConsumerClosed()

I think ConsumerClosed is to be called only once (at least according to its comment), it will get called again in Close


pkg/sql/row_source_to_plan_node.go, line 59 at r3 (raw file):

func (r *RowSourceToPlanNode) Values() tree.Datums {
	datums := make(tree.Datums, len(r.row))

This can be pre-allocated once (most planNodes do this)


pkg/util/fast_int_set.go, line 196 at r3 (raw file):

	}
	if s.large != nil {
		panic("dont do this")

This panic is not acceptable. We sometimes run tests with the "small set" cutoff set to a very small value to check there are no problems when falling back to large. The code below should work fine in the large mode.


Comments from Reviewable

@asubiotto
Copy link
Copy Markdown
Contributor

Nice work. Have you got a rough plan for the long term merge of the two execution engines? I'm still kind of cloudy on the details but wondering if this work has made the path clearer for you.


Review status: 0 of 19 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


pkg/sql/distinct.go, line 217 at r3 (raw file):

	post := &distsqlrun.PostProcessSpec{} // TODO? I think an empty PostProcessSpec is fine.
	var output distsqlrun.RowReceiver     // TODO

Comment explaining that output will not be used because distinct will be run as a RowSource


pkg/sql/plan_node_to_row_source.go, line 58 at r3 (raw file):

	}
	if !valid {
		p.valid = false

Where is p.valid used?


pkg/sql/plan_node_to_row_source.go, line 66 at r3 (raw file):

}

func (p *planNodeToRowSource) ConsumerDone() {}

I think this currently violates the RowSource contract because you would still be returning rows after ConsumerDone is called. A call to this should also result in a planNode.Close at some point. This usually happens after sending any trailing metadata, but in this case it should be immediate.


pkg/sql/plan_node_to_row_source.go, line 72 at r3 (raw file):

}

func (p *planNodeToRowSource) DatumsToEncDatumRow(values tree.Datums) sqlbase.EncDatumRow {

This might be a nice utility function to have in sqlbase (I think similar logic is used in the memRowContainer)


pkg/sql/row_source_to_plan_node.go, line 27 at r3 (raw file):

}

func CreateRowSourceToPlanNode(s distsqlrun.RowSource) RowSourceToPlanNode {

s/CreateRowSourceToPlanNode/MakeRowSourceToPlanNode and ditto in PlanNodeToRowSource


pkg/sql/row_source_to_plan_node.go, line 35 at r3 (raw file):

func (r *RowSourceToPlanNode) Next(params runParams) (bool, error) {
	if r.source == nil {
		return false, errors.Errorf("no valid rowsource initialized")

I think it's fine to assume that r.source != nil in Next, I believe we do this for processors too


pkg/sql/distsqlrun/base.go, line 155 at r3 (raw file):

type RowSourcedProcessor interface {
	RowSource
	Run(_ context.Context, wg *sync.WaitGroup)

Any reason not to embed the Processor interface as well?


Comments from Reviewable

@rjnn rjnn force-pushed the distinct_local_pr branch 3 times, most recently from 479c52f to a9efe37 Compare June 4, 2018 19:54
@rjnn rjnn requested a review from a team June 4, 2018 19:54
@rjnn rjnn force-pushed the distinct_local_pr branch 4 times, most recently from 9b7d2a6 to 4c49d49 Compare June 4, 2018 21:31
@rjnn
Copy link
Copy Markdown
Contributor Author

rjnn commented Jun 4, 2018

@asubiotto I updated #23901 with an extended plan.


Review status: 0 of 26 files reviewed at latest revision, 12 unresolved discussions, some commit checks pending.


pkg/sql/distinct.go, line 217 at r3 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Comment explaining that output will not be used because distinct will be run as a RowSource

Great point, thank you. Done.


pkg/sql/distsql_physical_planner.go, line 2196 at r3 (raw file):

Previously, RaduBerinde wrote…

unexport

Done.


pkg/sql/plan_node_to_row_source.go, line 11 at r3 (raw file):

Previously, RaduBerinde wrote…

What is this to be used for?

Eventually when we have multiple planNodes that are rowSourced, we'd want to fuse their execution pipeline and then unfuse them at the end. Right now we don't, so this is not used code. I can leave it out of this PR if you prefer.


pkg/sql/plan_node_to_row_source.go, line 58 at r3 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Where is p.valid used?

Huh, I was never using it. You're right, since the underlying RowSources track it internally, there's no need for us to redo that work. Removed.


pkg/sql/plan_node_to_row_source.go, line 66 at r3 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

I think this currently violates the RowSource contract because you would still be returning rows after ConsumerDone is called. A call to this should also result in a planNode.Close at some point. This usually happens after sending any trailing metadata, but in this case it should be immediate.

Added a comment that the caller should close after calling this.


pkg/sql/plan_node_to_row_source.go, line 72 at r3 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

This might be a nice utility function to have in sqlbase (I think similar logic is used in the memRowContainer)

Done. Moved to sqlbase/encoded_datum.go. Feel free to use it!


pkg/sql/row_source_to_plan_node.go, line 18 at r3 (raw file):

Previously, RaduBerinde wrote…

Does this need to be exported?

No. Unexported.


pkg/sql/row_source_to_plan_node.go, line 27 at r3 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

s/CreateRowSourceToPlanNode/MakeRowSourceToPlanNode and ditto in PlanNodeToRowSource

Done, and also unexported.


pkg/sql/row_source_to_plan_node.go, line 35 at r3 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

I think it's fine to assume that r.source != nil in Next, I believe we do this for processors too

Done. Removed. Thanks.


pkg/sql/row_source_to_plan_node.go, line 50 at r3 (raw file):

Previously, RaduBerinde wrote…

The check should be for r.row == nil. Processors can in principle output empty rows.

Good point. Thank you.


pkg/sql/row_source_to_plan_node.go, line 51 at r3 (raw file):

Previously, RaduBerinde wrote…

I think ConsumerClosed is to be called only once (at least according to its comment), it will get called again in Close

Done.


pkg/sql/row_source_to_plan_node.go, line 59 at r3 (raw file):

Previously, RaduBerinde wrote…

This can be pre-allocated once (most planNodes do this)

Done.


pkg/sql/distsqlrun/base.go, line 155 at r3 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Any reason not to embed the Processor interface as well?

You can't embed both RowSource and Processor, as you get an "error: duplicate method OutputTypes". The intention here is to have a union interface that satisfies both interfaces, but this is the only way I could get it to not complain. Is there a better way to achieve this?


pkg/util/fast_int_set.go, line 196 at r3 (raw file):

Previously, RaduBerinde wrote…

This panic is not acceptable. We sometimes run tests with the "small set" cutoff set to a very small value to check there are no problems when falling back to large. The code below should work fine in the large mode.

I ended up removing the whole function, it was just for rapid prototyping, and thus the code quality.


Comments from Reviewable

@rjnn rjnn changed the title DONOTMERGE sql: use distsql for local distinct sql: use distsql for local distinct Jun 4, 2018
@rjnn rjnn force-pushed the distinct_local_pr branch from 4c49d49 to 0afe295 Compare June 4, 2018 22:01
@RaduBerinde
Copy link
Copy Markdown
Member

Review status: 0 of 10 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


pkg/sql/plan_node_to_row_source.go, line 11 at r3 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

Eventually when we have multiple planNodes that are rowSourced, we'd want to fuse their execution pipeline and then unfuse them at the end. Right now we don't, so this is not used code. I can leave it out of this PR if you prefer.

Not sure I understand. Why would we need a row-source - to - planNode shim to fuse them? Anyway, I'd leave it out of this PR since it's just a distraction for now.


Comments from Reviewable

@rjnn rjnn force-pushed the distinct_local_pr branch from 0afe295 to c1e1c5a Compare June 7, 2018 19:55
Prerequisite before embedding it in the planNodes in Local SQL.

Release note: None
@jordanlewis
Copy link
Copy Markdown
Member

👍 this looks good!


Review status: 0 of 8 files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.


pkg/sql/plan_node_to_row_source.go, line 11 at r3 (raw file):

Previously, RaduBerinde wrote…

Not sure I understand. Why would we need a row-source - to - planNode shim to fuse them? Anyway, I'd leave it out of this PR since it's just a distraction for now.

We do need both directions of conversion, at least at first. If there are 3 planNodes in a row, A->B->C, and B is implemented as a wrapper around a Processor, you need a rowSourceToPlanNode wrapper to wrap the processor so that A can consume from B as a planNode, and you need a planNodeToRowSource wrapper to wrap the planNode C so that B's inner processor can consume from C as a RowSource.


Comments from Reviewable

@rjnn
Copy link
Copy Markdown
Contributor Author

rjnn commented Jun 11, 2018

:lgtm: but you know... Mine doesn't count.


Comments from Reviewable

@asubiotto
Copy link
Copy Markdown
Contributor

:lgtm: just some nits. The fact that ConsumerClosed and ConsumerDone are noops in planNodeToRowSource confuses me a bit so it might be worth writing a more extensive comment but otherwise looks good.


Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/distinct.go, line 211 at r8 (raw file):

	post := &distsqlrun.PostProcessSpec{} // post is not used as we only use the processor for the core distinct logic.
	var output distsqlrun.RowReceiver     // output is never used as distinct is only run as a RowSource.

nit: I personally find it easier to read comments when they are on the line above, but feel free to ignore


pkg/sql/plan_node_to_row_source.go, line 66 at r3 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

Added a comment that the caller should close after calling this.

I'm kind of confused. I think we need to implement RowSource contracts in this wrapper, this includes making ConsumerClosed actually close the planNode and have ConsumerDone behave how we expect it to, because the caller (in this case a distsql processor) doesn't know about the underlying RowSource implementation and expects


pkg/sql/row_source_to_plan_node.go, line 81 at r8 (raw file):

func (r *rowSourceToPlanNode) Close(ctx context.Context) {
	if r.source != nil {

nit: We can assume r.source is != nil here as well


pkg/sql/row_source_to_plan_node.go, line 86 at r8 (raw file):

}

var _ planNode = &rowSourceToPlanNode{}

nit: I would put this right below the struct definition


pkg/sql/distsqlrun/base.go, line 155 at r3 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

You can't embed both RowSource and Processor, as you get an "error: duplicate method OutputTypes". The intention here is to have a union interface that satisfies both interfaces, but this is the only way I could get it to not complain. Is there a better way to achieve this?

This is fine


Comments from Reviewable

@asubiotto
Copy link
Copy Markdown
Contributor

Review status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/plan_node_to_row_source.go, line 66 at r3 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

I'm kind of confused. I think we need to implement RowSource contracts in this wrapper, this includes making ConsumerClosed actually close the planNode and have ConsumerDone behave how we expect it to, because the caller (in this case a distsql processor) doesn't know about the underlying RowSource implementation and expects

^ disregard this unfinished comment, I thought I had deleted it. I'm still confused but it seems that implementing this logic even with removing the planNode.Close call in distinct results in some Close issues. A comment more extensively explaining this interaction would satisfy me.


Comments from Reviewable

@jordanlewis
Copy link
Copy Markdown
Member

I fixed the flow so that they're no longer no-ops. It makes more sense now. PTAL.


Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


Comments from Reviewable

@asubiotto
Copy link
Copy Markdown
Contributor

:lgtm:


Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


Comments from Reviewable

Use the distsql distinct processor for processing local sql
queries.

Closes cockroachdb#23901.

Release note: none
@jordanlewis
Copy link
Copy Markdown
Member

bors r+

craig bot pushed a commit that referenced this pull request Jun 12, 2018
25873: sql: use distsql for local distinct r=jordanlewis a=arjunravinarayan

Use the distsql distinct processor for processing local sql queries. This is a work-in-progress to demonstrate and begin a discussion around the shims used.

This PR is rebased on top of #25860, so only evaluate the last commit. The main interfaces on which I'd like discussion are RowSourceToPlanNode and PlanNodeToRowSource. The rest of this PR is exporting changes (2nd commit), some planning plumbing, and code removal in local distinct.

Co-authored-by: Arjun Narayan <arjun@cockroachlabs.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 12, 2018

Build succeeded

@craig craig bot merged commit c74fb21 into cockroachdb:master Jun 12, 2018
@rjnn rjnn deleted the distinct_local_pr branch September 10, 2018 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants