Skip to content

server, sql: Query cancellation for non-distributed queries#17003

Merged
itsbilal merged 4 commits intocockroachdb:masterfrom
itsbilal:query-cancellation
Jul 24, 2017
Merged

server, sql: Query cancellation for non-distributed queries#17003
itsbilal merged 4 commits intocockroachdb:masterfrom
itsbilal:query-cancellation

Conversation

@itsbilal
Copy link
Copy Markdown
Contributor

@itsbilal itsbilal commented Jul 12, 2017

Implements most of this RFC. Specifically:

  • Adds a new query ID, a uint128 generated from the HLC timestamp and encoded into a hex string. Shown in SHOW SESSIONS.
  • queryMeta gets cancellation flags which are checked repeatedly by planNodes (for most planNodes, that's every 100k rows processed).
  • Transaction state change statements (eg. COMMIT, ROLLBACK, BEGIN TRANSACTION) are not cancellable at any point. Every other statement is cancellable until its autocommit phase (if it has one).

@itsbilal itsbilal added this to the 1.1 milestone Jul 12, 2017
@itsbilal itsbilal requested a review from knz July 12, 2017 15:45
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@cuongdo
Copy link
Copy Markdown
Contributor

cuongdo commented Jul 13, 2017

cc @asubiotto who might have some thoughts


Review status: 0 of 41 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@knz
Copy link
Copy Markdown
Contributor

knz commented Jul 13, 2017

  1. please split the commit into three, one that adds query IDs, one that implements the registry and inspection, and one that implements cancellation. Right now there are too many separate changes interleaved, I can't review as-is.
  2. see my comments about sort.go
  3. we need to talk about the "check interval" and whether it is useful if you end up not checking a context's Done method

Reviewed 16 of 41 files at r1.
Review status: 16 of 41 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/sql/cancel_checker.go, line 63 at r1 (raw file):

	if !c.isCancelled && c.rowsSinceLastCheck%cancelCheckInterval == 0 {
		c.isCancelled = atomic.LoadInt32(&c.queryMeta.isCancelled) == 1

If you are using an atomic and there will be no context check, then there is also no need for a check interval.


pkg/sql/sqlbase/sort.go, line 24 at r1 (raw file):

import (
	"sort"

I am not OK with placing this code in the sqlbase package if it requires you to indirect the cancellation condition via an interface call.

Also, you used way too many calls to the cancel check method.


pkg/util/uint128/uint128_test.go, line 35 at r1 (raw file):

func TestString(t *testing.T) {
	s := "a95e31998f38490651c02b97c7f2acca"

Test the error cases too.


Comments from Reviewable

@itsbilal itsbilal force-pushed the query-cancellation branch 2 times, most recently from 3337fcd to 9d54697 Compare July 17, 2017 18:57
@itsbilal
Copy link
Copy Markdown
Contributor Author

@knz Did a split into 3 commits (slightly different from what you suggested, but it made the changes a lot easier to separate).


Review status: 14 of 39 files reviewed at latest revision, 3 unresolved discussions.


pkg/sql/cancel_checker.go, line 63 at r1 (raw file):

Previously, knz (kena) wrote…

If you are using an atomic and there will be no context check, then there is also no need for a check interval.

Adding a check for context.Done too now.


pkg/util/uint128/uint128_test.go, line 35 at r1 (raw file):

Previously, knz (kena) wrote…

Test the error cases too.

On it - will also add more tests in the next commit.


pkg/sql/sqlbase/sort.go, line 24 at r1 (raw file):
Moved it to sql and removed the interface.

Also, you used way too many calls to the cancel check method.

Not sure where specifically I'm checking too often. But I've removed one fairly-redundant check in heapSort and moved the check in quickSort to just before (not after) the recursive calls - it should space out the checks more evenly, time wise.


Comments from Reviewable

@knz
Copy link
Copy Markdown
Contributor

knz commented Jul 17, 2017

Thanks for all the good work.
Some comments below.
Also, throughout:

  1. we don't usually capitalize error messages because they are likely to be embedded in a larger string after a colon.
  2. don't store the cancelChecker references in the various planNodes. Instead change the prototype of Next() to pass the checker as argument. Better yet, make a new struct (nextParams or similar) and put both the context and the cancelChecker into that, and pass the struct by-value through the Next calls.
    (Why: we already know we need to pass more stuff through the Next recursion - just waiting for the right reason to make that change, it looks like your PR is just the reason we needed 😸 )

Reviewed 2 of 41 files at r1, 28 of 30 files at r2, 14 of 14 files at r3, 14 of 14 files at r4.
Review status: all files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.


pkg/server/serverpb/status.proto, line 268 at r2 (raw file):

// ActiveQuery represents a query in flight on some Session.
message ActiveQuery {
  // ID of the query (uint128 presented as a hexadecimal string).

Maybe this should be encoded as the 16 bytes of the uuid instead of the 32 bytes of the hex representation!


pkg/sql/cancel_checker.go, line 55 at r4 (raw file):

// check returns an error if the associated query has been cancelled.
func (c *cancelChecker) Check() error {
	// No need to check anything for internal planners.

Don't make the conditional like this. Instead use a do-nothing function as cancel checker for internal planners.

(Basic principle: don't pay N times for something you can check ahead of time just once)


pkg/sql/crdb_internal.go, line 559 at r2 (raw file):

const queriesSchemaPattern = `
CREATE TABLE crdb_internal.%s (
  id               STRING,         -- the cluster-unique ID of the query

s/id/query_id/


pkg/sql/delete.go, line 142 at r4 (raw file):

func (d *deleteNode) preFinalize() error {
	// Mark this query as non-cancellable.
	if d.p.autoCommit && d.p.stmt != nil {

Factor this code with the one in update.go in a single method called from both places.


pkg/sql/executor.go, line 1063 at r2 (raw file):

		queryMeta := &queryMeta{
			start: timeutil.Now(),

make the timestamp here and in phaseTimes be the same.


pkg/sql/executor.go, line 379 at r3 (raw file):

	}

	return false, fmt.Errorf("No query found for query ID %s", queryID)

"query ID %s not found"


pkg/sql/filter_opt.go, line 323 at r4 (raw file):

	case *alterTableNode:
	case *cancelQueryNode:

wrong commit, it seems


pkg/sql/insert.go, line 263 at r4 (raw file):

		if err == nil {
			// Mark this query as non-cancellable
			if n.p.autoCommit && n.p.stmt != nil {

ditto update/delete


pkg/sql/session.go, line 645 at r2 (raw file):

func (s *Session) addActiveQuery(queryID string, queryMeta *queryMeta) {
	s.mu.Lock()
	defer s.mu.Unlock()

Out of curiosity why add a defer here?


pkg/sql/truncate.go, line 96 at r4 (raw file):

	}

	// Mark statement as non-cancellable.

Explain why.


pkg/sql/update.go, line 413 at r4 (raw file):

			if u.p.autoCommit && u.p.stmt != nil {
				queryMeta := u.p.stmt.queryMeta
				if err := queryMeta.setNonCancellable(); err != nil {

I do not understand why you are marking the query non-cancellable only here. What is the motivation? The comment must be extended.


Comments from Reviewable

@knz
Copy link
Copy Markdown
Contributor

knz commented Jul 17, 2017

I forgot to ask, you also will need to compute before/after perf measurements before this PR can merge.
Use queries with lots of result rows, of different kinds (unique/order by/etc)

@itsbilal itsbilal force-pushed the query-cancellation branch 3 times, most recently from 533e9a3 to bd4f9fd Compare July 18, 2017 21:45
@itsbilal
Copy link
Copy Markdown
Contributor Author

Made that refactor of planNode.Next() as you suggested. Con: the diff is now huge.

I've yet to address some of your minor comments about the insert/update/delete refactor and the uncancellable comment (and I'll get to both of those very shortly), but otherwise everything else should be good.


Review status: 9 of 65 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


pkg/server/serverpb/status.proto, line 268 at r2 (raw file):

Previously, knz (kena) wrote…

Maybe this should be encoded as the 16 bytes of the uuid instead of the 32 bytes of the hex representation!

Do you mean just replacing it with bytes that has uint128 as the gogoproto.customtype? That would reduce the size of the object sent over the wire, but it'd move the byte-to-string marshaling to the node where ListSessions or ListQueries is running. Would it still be worth?


pkg/sql/cancel_checker.go, line 55 at r4 (raw file):

Previously, knz (kena) wrote…

Don't make the conditional like this. Instead use a do-nothing function as cancel checker for internal planners.

(Basic principle: don't pay N times for something you can check ahead of time just once)

Done. Brought back the interface that has two implementations - one that does nothing (for internal planners).


pkg/sql/crdb_internal.go, line 559 at r2 (raw file):

Previously, knz (kena) wrote…

s/id/query_id/

Done.


pkg/sql/executor.go, line 1063 at r2 (raw file):

Previously, knz (kena) wrote…

make the timestamp here and in phaseTimes be the same.

The closest past phaseTimes timestamp at this point would be sessionEndParse - which isn't the best thing to intuitively have this match to. Either way, I've made the change - queryMeta.start = session.phaseTimes[sessionEndParse]


pkg/sql/executor.go, line 379 at r3 (raw file):

Previously, knz (kena) wrote…

"query ID %s not found"

Done.


pkg/sql/session.go, line 645 at r2 (raw file):

Previously, knz (kena) wrote…

Out of curiosity why add a defer here?

I could just Unlock it after setting the map, but it's basically the same thing. I thought defers were preferable just in case the function gets an early return down the line.


pkg/sql/update.go, line 413 at r4 (raw file):

Previously, knz (kena) wrote…

I do not understand why you are marking the query non-cancellable only here. What is the motivation? The comment must be extended.

The goal is to mark all entries to the autoCommit path as uncancellable, as we talked about in the RFC. In this case, tw.finalize() will send out the EndTransaction request. Will expand the comment in the next update.


pkg/util/uint128/uint128_test.go, line 35 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

On it - will also add more tests in the next commit.

Done.


Comments from Reviewable

@knz
Copy link
Copy Markdown
Contributor

knz commented Jul 18, 2017

Thank you for the effort. I ❤️ this PR.

I also just saw the light:

  • put the cancelchecker in the planner
  • pass the planner reference as a member of the (new) nextParams struct
  • pass nextParams to Start() too
  • remove the existing planner reference also stored in the planNodes! Because now each Next/Start method that needs it can pull it from the nextParams.

If you're lucky your PR would perhaps even get a negative line count :)


Reviewed 1 of 30 files at r2, 1 of 29 files at r5, 10 of 14 files at r6, 42 of 42 files at r7, 3 of 3 files at r8.
Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


pkg/server/serverpb/status.proto, line 268 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Do you mean just replacing it with bytes that has uint128 as the gogoproto.customtype? That would reduce the size of the object sent over the wire, but it'd move the byte-to-string marshaling to the node where ListSessions or ListQueries is running. Would it still be worth?

hmm maybe not. Let this sleep.


pkg/sql/create.go, line 701 at r8 (raw file):

		// method since CREATE TABLE is a DDL statement and Executor only
		// runs Next() for statements with type "Rows".
		nextParams := nextParams{

Make Start() take the params as argument too. Construct the cancel checker in the executor.


pkg/sql/information_schema.go, line 810 at r8 (raw file):

		return err
	}
	nextParams := nextParams{

I would make the planner{} object contain the cancelchecker, so as to not have to construct it anew everywhere it's needed.


pkg/sql/join.go, line 319 at r8 (raw file):

	// Load all the rows from the right side and build our hashmap.
	acc := n.bucketsMemAcc.Wtxn(n.planner.session)
	nextParams := nextParams{

see my other comment about setting up the cancelchecker in the planner directly.


pkg/sql/session.go, line 645 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I could just Unlock it after setting the map, but it's basically the same thing. I thought defers were preferable just in case the function gets an early return down the line.

yeah but a defer has a non-trivial cost. That's really why we don't use defer everywhere already.


pkg/sql/update.go, line 413 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

The goal is to mark all entries to the autoCommit path as uncancellable, as we talked about in the RFC. In this case, tw.finalize() will send out the EndTransaction request. Will expand the comment in the next update.

This still needs to be documented in the code.

Also here and in other places this logic must be captured in a single method reused everywhere where this is relevant.
That method would then carry the motivating comment.


Comments from Reviewable

@itsbilal
Copy link
Copy Markdown
Contributor Author

Addressed almost all of your comments, and started passing nextParams to Start() as well (though it should probably get a better name at this point - planNodeParams? planParams?). Passing the planner instead of cancelChecker is the only remaining TODO (there would be some cases, eg. in distsql_physical_planner where it will be null).

Technically, if we wanted to put the cancelChecker inside the planner in the first place, we didn't need any of the planNode refactors (since the relevant planNodes already have a planner reference) - but nextParams seems like a cleaner approach forward anyway.


Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


pkg/sql/create.go, line 701 at r8 (raw file):

Previously, knz (kena) wrote…

Make Start() take the params as argument too. Construct the cancel checker in the executor.

Done.


pkg/sql/delete.go, line 142 at r4 (raw file):

Previously, knz (kena) wrote…

Factor this code with the one in update.go in a single method called from both places.

Done.


pkg/sql/information_schema.go, line 810 at r8 (raw file):

Previously, knz (kena) wrote…

I would make the planner{} object contain the cancelchecker, so as to not have to construct it anew everywhere it's needed.

Done.


pkg/sql/insert.go, line 263 at r4 (raw file):

Previously, knz (kena) wrote…

ditto update/delete

Done.


pkg/sql/join.go, line 319 at r8 (raw file):

Previously, knz (kena) wrote…

see my other comment about setting up the cancelchecker in the planner directly.

Done.


pkg/sql/session.go, line 645 at r2 (raw file):

Previously, knz (kena) wrote…

yeah but a defer has a non-trivial cost. That's really why we don't use defer everywhere already.

Done.


pkg/sql/truncate.go, line 96 at r4 (raw file):

Previously, knz (kena) wrote…

Explain why.

Done.


pkg/sql/update.go, line 413 at r4 (raw file):

Previously, knz (kena) wrote…

This still needs to be documented in the code.

Also here and in other places this logic must be captured in a single method reused everywhere where this is relevant.
That method would then carry the motivating comment.

Done.


Comments from Reviewable

@itsbilal itsbilal force-pushed the query-cancellation branch from bd4f9fd to 11388f2 Compare July 19, 2017 20:34
@knz
Copy link
Copy Markdown
Contributor

knz commented Jul 19, 2017

nextParams -> runParams or execParams?

Remember once the planner is in the nextParams/runParams/execParams you can drop the planner reference from all the planNodes! I am looking forward to this code reduction.


Reviewed 2 of 57 files at r9, 8 of 14 files at r10, 44 of 44 files at r11, 3 of 3 files at r12.
Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


Comments from Reviewable

@itsbilal itsbilal force-pushed the query-cancellation branch 2 times, most recently from 2b9c555 to d332315 Compare July 20, 2017 18:59
@itsbilal
Copy link
Copy Markdown
Contributor Author

Renamed nextParams to runParams. Dropped all the planner references that could be dropped; didn't drop in the few cases where the planNode's planner reference was being used far away from Next() and Start().

We also talked yesterday about completely removing the uncancellable phase (instead opting to leave the outcome of the CANCEL QUERY ambiguous in all phases, which it already was in the cancellable phase anyway). The rationale for this would be to simplify the codebase, to prevent an implicit perception of a guarantee that the query is cancellable in the pre-non-cancellable phase, and also to prevent the user from getting the notion that the query has been committed (which it may not be).

Making those changes next.


Review status: 6 of 65 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


Comments from Reviewable

@itsbilal itsbilal force-pushed the query-cancellation branch from d332315 to 07f920f Compare July 20, 2017 21:04
@itsbilal
Copy link
Copy Markdown
Contributor Author

Non-cancellable phase is gone, and the RFC has been updated to reflect that.


Review status: 5 of 65 files reviewed at latest revision, 1 unresolved discussion.


Comments from Reviewable

@knz
Copy link
Copy Markdown
Contributor

knz commented Jul 20, 2017

:lgtm:

Please explain what you did to Andrew, he's cooking a change to the table writer's finalize method and he needs to learn about the cancelChecker.


Reviewed 1 of 59 files at r13, 12 of 18 files at r14, 44 of 44 files at r15, 3 of 3 files at r16.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


Comments from Reviewable

@itsbilal itsbilal force-pushed the query-cancellation branch from 07f920f to d41728e Compare July 21, 2017 19:10
@itsbilal itsbilal force-pushed the query-cancellation branch from d41728e to ad67677 Compare July 24, 2017 13:14
@itsbilal itsbilal force-pushed the query-cancellation branch from ad67677 to 192658a Compare July 24, 2017 13:40
@itsbilal itsbilal merged commit 3f9d7e8 into cockroachdb:master Jul 24, 2017
@itsbilal itsbilal deleted the query-cancellation branch July 24, 2017 13:59
@andreimatei
Copy link
Copy Markdown
Contributor

I'm super excited to see this happening. Can't wait to see the code.

Here's a drive-by feature request that I wanna see if I can get you interested in: when a query processed through the parallelizeQueue returns an error, we should cancel all the other concurrent queries on the session. This came to my mind when writing up #17197.


Review status: 4 of 65 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


Comments from Reviewable

itsbilal added a commit to itsbilal/cockroach that referenced this pull request Jul 27, 2017
This reverts some of the changes made in PRs cockroachdb#17003 and cockroachdb#17200. In
particular:

17003 made new contexts for every statement, but didn't cancel them
after statement execution, resulting in a memory leak.
17200 did a quick-fix for that memory leak by not forking that context,
at the expense of not being able to leverage the context in
cancellation.

This PR restores the context cancellation behaviour introduced in cockroachdb#17003
, except at the transaction level - cancelling a query closes its
transaction's context.
itsbilal added a commit to itsbilal/cockroach that referenced this pull request Jul 27, 2017
This reverts some of the changes made in PRs cockroachdb#17003 and cockroachdb#17200. In
particular:

17003 made new contexts for every statement, but didn't cancel them
after statement execution, resulting in a memory leak.
17200 did a quick-fix for that memory leak by not forking that context,
at the expense of not being able to leverage the context in
cancellation.

This PR restores the context cancellation behaviour introduced in cockroachdb#17003
, except at the transaction level - cancelling a query closes its
transaction's context.
itsbilal added a commit to itsbilal/cockroach that referenced this pull request Jul 31, 2017
This reverts some of the changes made in PRs cockroachdb#17003 and cockroachdb#17200. In
particular:

17003 made new contexts for every statement, but didn't cancel them
after statement execution, resulting in a memory leak.
17200 did a quick-fix for that memory leak by not forking that context,
at the expense of not being able to leverage the context in
cancellation.

This PR restores the context cancellation behaviour introduced in cockroachdb#17003
, except at the transaction level - cancelling a query closes its
transaction's context.
itsbilal added a commit to itsbilal/cockroach that referenced this pull request Aug 1, 2017
This reverts some of the changes made in PRs cockroachdb#17003 and cockroachdb#17200. In
particular:

17003 made new contexts for every statement, but didn't cancel them
after statement execution, resulting in a memory leak.
17200 did a quick-fix for that memory leak by not forking that context,
at the expense of not being able to leverage the context in
cancellation.

This PR restores the context cancellation behaviour introduced in cockroachdb#17003
, except at the transaction level - cancelling a query closes its
transaction's context.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants