Skip to content

storage/concurrency: randomized test for lockTable with concurrency#44791

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
sumeerbhola:locktable
Feb 9, 2020
Merged

storage/concurrency: randomized test for lockTable with concurrency#44791
craig[bot] merged 1 commit intocockroachdb:masterfrom
sumeerbhola:locktable

Conversation

@sumeerbhola
Copy link
Copy Markdown
Collaborator

and additional datadriven tests

There are two randomized tests:

  • with only non-transactional requests
  • with transactional requests. This can result in deadlock
    so the test includes simple cycle detecting and aborting
    of requests.

This includes fixes for bugs found by these tests:

  • Two nil pointer deferences for non-transactional requests.
  • The distinguished waiter was not being updated correctly
    when a reservation was acquired on a free lock -- this
    could cause a different request from the same transaction
    to be a distinguished waiter which triggered a later
    invariant to be broken and a "bug" error to be returned.
    Found by the randomized test, and there is now a datadriven
    test for it.
  • Small design flaw: A reservation held by a request can be broken
    while that request is holding latches and has proceeded
    to evaluation. This is harmless since the request that
    has broken the reservation cannot evaluate yet since it
    is not holding latches, but it tripped up an invariant
    in the lock acquire path which checked that the reservation
    was not held by some other request.
    Found by the randomized test, and there is now a datadriven
    test for it.

Release note: None

@sumeerbhola sumeerbhola requested a review from nvb February 6, 2020 02:45
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small design flaw: A reservation held by a request can be broken while that request is holding latches and has proceeded to evaluation. This is harmless since the request that has broken the reservation cannot evaluate yet since it is not holding latches, but it tripped up an invariant in the lock acquire path which checked that the reservation was not held by some other request.

This is interesting. Up to this point, I had been under the impression that to break a reservation, a transaction would need to be holding latches. I think this was because I was working under the outdated idea that a transaction would enter all lock wait-queues before releasing latches and only then begin waiting on the queues. During this initial scan, while still holding latches, a transaction would break other reservations where necessary. We decided that doing this compromised performance for some fairness and that it wasn't worth doing so. Should we reconsider?

Reviewed 2 of 3 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/storage/concurrency/lock_table.go, line 1098 at r1 (raw file):

		return nil, nil
	}
	// Not already held, so may be reserved by this request. There is also the

The invariant that any lock acquisition would be performed by the reservation holder seemed very nice. I wonder if there's any way we could get it back.


pkg/storage/concurrency/lock_table_test.go, line 470 at r1 (raw file):

					panic(err)
				}
				if state.stateKind == doneWaiting {

nit: we usually use switch statements when matching on enums in Go.


pkg/storage/concurrency/lock_table_test.go, line 571 at r1 (raw file):

numConccViolations

numConcViolations?


pkg/storage/concurrency/lock_table_test.go, line 588 at r1 (raw file):

	err := e.lt.acquireLock(k, lock.Exclusive, lock.Unreplicated, g)
	if err != nil {
		fmt.Printf("locktable: %v\n", e.lt)

Did you mean to leave this in?


pkg/storage/concurrency/lock_table_test.go, line 589 at r1 (raw file):

	if err != nil {
		fmt.Printf("locktable: %v\n", e.lt)
		panic(err)

I'd suggest propagating these errors up to the main testing Goroutine using an errgroup.Group and then calling t.Fatal. Panicking in tests can get messy. It leaks into other tests in the same package and isn't always handled well.


pkg/storage/concurrency/lock_table_test.go, line 600 at r1 (raw file):

}

func (e *workloadExecutor) findCycle(node *uuid.UUID, cycleNode *uuid.UUID) bool {

nit: why the pointers?


pkg/storage/concurrency/lock_table_test.go, line 616 at r1 (raw file):

}

func (e *workloadExecutor) waitingFor(waiter uuid.UUID, lastID uuid.UUID, currID uuid.UUID) bool {

Could you give this function a comment?


pkg/storage/concurrency/lock_table_test.go, line 649 at r1 (raw file):

}

func (e *workloadExecutor) tryFinishTxn(txnID uuid.UUID, tstate *transactionState) bool {

What is the return value here?


pkg/storage/concurrency/lock_table_test.go, line 662 at r1 (raw file):

// When executing in non-strict mode, the concurrency bound is not necessarily
// respected since requests may be waiting for locks to be released.

Could you add a bit more about what strict means?


pkg/storage/concurrency/lock_table_test.go, line 684 at r1 (raw file):

			case <-timer.C:
				if strictIter {
					panic(fmt.Errorf("timer expired with lock table: %v", e.lt))

Do you have any concerns about this flaking when a test machine is under heavy load?


pkg/storage/concurrency/lock_table_test.go, line 747 at r1 (raw file):

// Randomized test with each transaction having a single request that does not
// acquire locks.

Add here that by not acquiring locks, we never create situations that would cause deadlocks.


pkg/storage/concurrency/lock_table_test.go, line 797 at r1 (raw file):

			startedTxnIDs = append(startedTxnIDs, txnMeta.ID)
		}
		for len(startedTxnIDs) > maxStartedTxns {

Consider making this random between [0, maxStartedTxns) so that we do occasionally drain back down to 0.


pkg/storage/concurrency/lock_table_test.go, line 833 at r1 (raw file):

	const numActiveTxns = 8
	var activeTxns [numActiveTxns]*enginepb.TxnMeta
	accesses := [2]spanset.SpanAccess{spanset.SpanReadOnly, spanset.SpanReadWrite}

nit: you could also just cast a 0 or 1 to spanset.SpanAccess. Or even better, acc := spanset.SpanAccess(rng.Intn(int(spanset.NumSpanAccess)))


pkg/storage/concurrency/lock_table_test.go, line 862 at r1 (raw file):

		keysPerm := rng.Perm(len(keys))
		spans := &spanset.SpanSet{}
		onlyReads := txnMeta == nil

I'm realizing now that we do have and non-transactional write requests, but these requests just can't acquire locks. Instead, they just write directly to an MVCC timestamp.

Do we handle that correctly in the lockTable? I think we'd need to avoid adding them to the queuedWriters list, which indicates that the naming of queuedWriters and waitingReaders may not be quite right.

Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following breaking scenario would happen even if a request entered all current queues:

  • req1 writes to A and B and lock A is held. Since the only queue is at A it enqueues there.
  • req2 writes to B and acquires latch at B, finds no queue there and acquires the lock. I am assuming we don't want to look at all waiting requests to see if they should be added to the queue at B.
  • req2 writes to B and enters the queue at B.
  • lock at B is released. req2 gets the reservation.
  • req2 reacquires latch B and does scan and is told it can evaluate. It is still holding latch B.
  • someone acquires latch A and releases lock A.
  • req1 breaks the reservation of req2 at B.
  • req2 tries to acquire lock B.

To prevent something like the above we could support queueing on spans (that have no current queue) and make ordering be strictly fair -- ignoring the complexity introduced by the former, I worry that the latter may lose us too much concurrency.

Another possibility I considered was for the scan done by req2 while holding latches to change its reservations to a "hardened" state that cannot be broken. But that is more state keeping (so more possibilities of bugs if forgetting to change something) and requires iterating twice over the lockState objects -- first to notice that all locks are reserved and then to change them to hardened. The fix here was simpler.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/storage/concurrency/lock_table_test.go, line 862 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm realizing now that we do have and non-transactional write requests, but these requests just can't acquire locks. Instead, they just write directly to an MVCC timestamp.

Do we handle that correctly in the lockTable? I think we'd need to avoid adding them to the queuedWriters list, which indicates that the naming of queuedWriters and waitingReaders may not be quite right.

Presumably these requests when evaluating will (a) discover an intent, (b) if there is no intent make sure to not violate monotonicity of timestamps?
We do not handle this correctly in the lockTable -- the current assumption is that any request that has a SpanReadWrite has a non-nil TxnMeta.

The ordering can be broken down into:

  • does it wait for lock holder: it must
  • does it acquire a reservation: No, since one can construct a deadlock scenario where txn1 holds lock A, req-nil is waiting for it, and at lock B req-nil has the reservation and a different request from txn1 is waiting for it.
  • does it wait in the queue for its turn and does it wait for the reservation holder: We could make it wait on the side and ignore reservation holders, like readers, but the main reason for not doing this for readers is that tracking who they should push became harder (they may conflict with someone ahead of them in the queue and not the reservation holder). That reason does not apply here, so another possibility is to make it queue up, and when it gets to the front it can proceed to evaluation. Additionally making it wait for the reservation holder is messier since it may have a lower sequence number than the reservation holder.

Do you have an opinion? Otherwise I can document the possibilities in a code comment and just do whatever is simplest.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in person and came to a third option. We could scan the lockTable and collect all conflicts while holding latches. During this time, we would perform any necessary breaking of reservations. If we entered any queues during this scan, we would then drop latches and begin waiting on the locks we collected, one-by-one. This fits in nicely with the immutable btree structure and would allow us to maintain the invariant that if a lock is acquired and the lockState previously had a reservation, then the reservation holder is the request acquiring the lock.

We also discussed exploring the idea of keeping requests in queues that they have acquired locks for. This avoids some complexity in acquireLock that doesn't appear to be necessary (but may be).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/storage/concurrency/lock_table_test.go, line 862 at r1 (raw file):

Previously, sumeerbhola wrote…

Presumably these requests when evaluating will (a) discover an intent, (b) if there is no intent make sure to not violate monotonicity of timestamps?
We do not handle this correctly in the lockTable -- the current assumption is that any request that has a SpanReadWrite has a non-nil TxnMeta.

The ordering can be broken down into:

  • does it wait for lock holder: it must
  • does it acquire a reservation: No, since one can construct a deadlock scenario where txn1 holds lock A, req-nil is waiting for it, and at lock B req-nil has the reservation and a different request from txn1 is waiting for it.
  • does it wait in the queue for its turn and does it wait for the reservation holder: We could make it wait on the side and ignore reservation holders, like readers, but the main reason for not doing this for readers is that tracking who they should push became harder (they may conflict with someone ahead of them in the queue and not the reservation holder). That reason does not apply here, so another possibility is to make it queue up, and when it gets to the front it can proceed to evaluation. Additionally making it wait for the reservation holder is messier since it may have a lower sequence number than the reservation holder.

Do you have an opinion? Otherwise I can document the possibilities in a code comment and just do whatever is simplest.

What you said in your comment is exactly the approach I would explore. We'd add them to the queuedWriters list, but we wouldn't let them acquire reservations. The one thing I'm not sure about is your comment about waiting for the reservation holder. Why wouldn't we want it to wait for the reservation holder?

Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/concurrency/lock_table.go, line 1098 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The invariant that any lock acquisition would be performed by the reservation holder seemed very nice. I wonder if there's any way we could get it back.

I've added a TODO to remove this when switching to the other btree implementation.


pkg/storage/concurrency/lock_table_test.go, line 470 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: we usually use switch statements when matching on enums in Go.

I spent some time debugging after switching over to using switch and then realized that break inside a switch only breaks out of the switch even though that break is built into the switch semantics. I started yearning for C++ and left this unchanged :-)


pkg/storage/concurrency/lock_table_test.go, line 571 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
numConccViolations

numConcViolations?

Done


pkg/storage/concurrency/lock_table_test.go, line 588 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you mean to leave this in?

Removed.


pkg/storage/concurrency/lock_table_test.go, line 589 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'd suggest propagating these errors up to the main testing Goroutine using an errgroup.Group and then calling t.Fatal. Panicking in tests can get messy. It leaks into other tests in the same package and isn't always handled well.

Thanks for the suggestion -- my Go knowledge is poor. Done.


pkg/storage/concurrency/lock_table_test.go, line 600 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: why the pointers?

No reason. Fixed.


pkg/storage/concurrency/lock_table_test.go, line 616 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you give this function a comment?

Done


pkg/storage/concurrency/lock_table_test.go, line 649 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What is the return value here?

Added a comment.


pkg/storage/concurrency/lock_table_test.go, line 662 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a bit more about what strict means?

Done


pkg/storage/concurrency/lock_table_test.go, line 684 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you have any concerns about this flaking when a test machine is under heavy load?

I increased the deadline. I put the deadline here instead of letting the test exceed its deadline since this prints the lock table contents.


pkg/storage/concurrency/lock_table_test.go, line 747 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add here that by not acquiring locks, we never create situations that would cause deadlocks.

Done


pkg/storage/concurrency/lock_table_test.go, line 797 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider making this random between [0, maxStartedTxns) so that we do occasionally drain back down to 0.

Done


pkg/storage/concurrency/lock_table_test.go, line 833 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: you could also just cast a 0 or 1 to spanset.SpanAccess. Or even better, acc := spanset.SpanAccess(rng.Intn(int(spanset.NumSpanAccess)))

Done


pkg/storage/concurrency/lock_table_test.go, line 862 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What you said in your comment is exactly the approach I would explore. We'd add them to the queuedWriters list, but we wouldn't let them acquire reservations. The one thing I'm not sure about is your comment about waiting for the reservation holder. Why wouldn't we want it to wait for the reservation holder?

Added a TODO.

@sumeerbhola sumeerbhola force-pushed the locktable branch 2 times, most recently from 64617bc to 49e3752 Compare February 7, 2020 20:51
Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/concurrency/lock_table_test.go, line 862 at r1 (raw file):

Previously, sumeerbhola wrote…

Added a TODO.

Regarding waiting for the reservation holder there is some subtlety: waiting when the reservation is held by a lower sequence num is ok, but not for a higher sequence num (that would be very confusing). The changes for this are in https://github.com/sumeerbhola/cockroach/blob/ltwrite/pkg/storage/concurrency/lock_table.go -- it turned out to not be hard.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: other than the one question about removing the guard argument in acquireLock. Let's sequence this before #44787.

Also, while you're here, do you mind removing the comment about the "one rare case" on the lockTableGuardImpl comment?

Reviewed 2 of 2 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/storage/concurrency/lock_table.go, line 1123 at r2 (raw file):

			return nil, errors.Errorf("lockTable bug")
		}
		g.mu.Lock()

Could you add a TODO to explore removing the deletion from g.mu.locks here and below? It still doesn't sound like it's essential, and if not, I think we'd be better off removing it and removing complexity.

Also, even if we can't remove it, is it necessary to have this logic here and below? If we're already iterating through all queuedWriters and looking for matching txnIDs, won't that already handle removing from the acquiring guard's lock set? if so, can we remove the *requestGuardImpl argument now?


pkg/storage/concurrency/lock_table_test.go, line 470 at r1 (raw file):

Previously, sumeerbhola wrote…

I spent some time debugging after switching over to using switch and then realized that break inside a switch only breaks out of the switch even though that break is built into the switch semantics. I started yearning for C++ and left this unchanged :-)

You can break to a specific label in Go: https://www.ardanlabs.com/blog/2013/11/label-breaks-in-go.html. That might help here.

Whichever approach you choose though, I think you should exhaustively test against each stateKind here, even if some unexpected states throw errors. Otherwise, it's hard to know that we're handling all cases properly and not accidentally falling through.

EDIT: looks like you found out about labels 😃


pkg/storage/concurrency/lock_table_test.go, line 600 at r1 (raw file):

Previously, sumeerbhola wrote…

No reason. Fixed.

👍 you should be able to use node == cycleNode directly now.


pkg/storage/concurrency/lock_table_test.go, line 833 at r1 (raw file):

Previously, sumeerbhola wrote…

Done

Some point above.


pkg/storage/concurrency/lock_table_test.go, line 454 at r2 (raw file):

		for {
			// Since we can't do a select involving latch acquisition and context
			// cancellation, we make sure to release latches if return early due to

"if return early"

and additional datadriven tests

There are two randomized tests:
- with only non-transactional requests
- with transactional requests. This can result in deadlock
  so the test includes simple cycle detecting and aborting
  of requests.

This includes fixes for bugs found by these tests:
- Two nil pointer deferences for non-transactional requests.
- The distinguished waiter was not being updated correctly
  when a reservation was acquired on a free lock -- this
  could cause a different request from the same transaction
  to be a distinguished waiter which triggered a later
  invariant to be broken and a "bug" error to be returned.
  Found by the randomized test, and there is now a datadriven
  test for it.
- Small design flaw: A reservation held by a request can be broken
  while that request is holding latches and has proceeded
  to evaluation. This is harmless since the request that
  has broken the reservation cannot evaluate yet since it
  is not holding latches, but it tripped up an invariant
  in the lock acquire path which checked that the reservation
  was not held by some other request.
  Found by the randomized test, and there is now a datadriven
  test for it.

Release note: None
Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's sequence this before #44787.

Thanks. I'll try to merge tonight. And after #44787 is merged I'll send a PR to add the fix for the non-transactional write case.

Also, while you're here, do you mind removing the comment about the "one rare case" on the lockTableGuardImpl comment?

Done

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)


pkg/storage/concurrency/lock_table.go, line 1123 at r2 (raw file):

Could you add a TODO to explore removing the deletion from g.mu.locks here and below? It still doesn't sound like it's essential, and if not, I think we'd be better off removing it and removing complexity.

I've added a TODO where locks is declared.

Also, even if we can't remove it, is it necessary to have this logic here and below? If we're already iterating through all queuedWriters and looking for matching txnIDs, won't that already handle removing from the acquiring guard's lock set? if so, can we remove the *requestGuardImpl argument now?

I've changed this to taking a *TxnMeta and hlc.Timestamp argument so your PR should only need to tweak it to change the latter to using TxnMeta.WriteTimestamp -- changing to the latter in this PR itself would be a little inconsistent.
I've also included the fix from #44787 that used beforeTxn so reconciling the concurrent changes should be be limited to changing the function parameters.


pkg/storage/concurrency/lock_table_test.go, line 470 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You can break to a specific label in Go: https://www.ardanlabs.com/blog/2013/11/label-breaks-in-go.html. That might help here.

Whichever approach you choose though, I think you should exhaustively test against each stateKind here, even if some unexpected states throw errors. Otherwise, it's hard to know that we're handling all cases properly and not accidentally falling through.

EDIT: looks like you found out about labels 😃

I've changed to using a switch statement :-)


pkg/storage/concurrency/lock_table_test.go, line 600 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

👍 you should be able to use node == cycleNode directly now.

Done


pkg/storage/concurrency/lock_table_test.go, line 833 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Some point above.

Overlooked in my haste.


pkg/storage/concurrency/lock_table_test.go, line 454 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"if return early"

I've rephrased.

@sumeerbhola
Copy link
Copy Markdown
Collaborator Author

bors r+

craig bot pushed a commit that referenced this pull request Feb 9, 2020
44791: storage/concurrency: randomized test for lockTable with concurrency r=sumeerbhola a=sumeerbhola

and additional datadriven tests

There are two randomized tests:
- with only non-transactional requests
- with transactional requests. This can result in deadlock
  so the test includes simple cycle detecting and aborting
  of requests.

This includes fixes for bugs found by these tests:
- Two nil pointer deferences for non-transactional requests.
- The distinguished waiter was not being updated correctly
  when a reservation was acquired on a free lock -- this
  could cause a different request from the same transaction
  to be a distinguished waiter which triggered a later
  invariant to be broken and a "bug" error to be returned.
  Found by the randomized test, and there is now a datadriven
  test for it.
- Small design flaw: A reservation held by a request can be broken
  while that request is holding latches and has proceeded
  to evaluation. This is harmless since the request that
  has broken the reservation cannot evaluate yet since it
  is not holding latches, but it tripped up an invariant
  in the lock acquire path which checked that the reservation
  was not held by some other request.
  Found by the randomized test, and there is now a datadriven
  test for it.

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 9, 2020

Build succeeded

@craig craig bot merged commit 924830b into cockroachdb:master Feb 9, 2020
@sumeerbhola sumeerbhola deleted the locktable branch February 10, 2020 14:28
Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)


pkg/storage/concurrency/lock_table.go, line 1361 at r3 (raw file):

	ts := intent.Txn.WriteTimestamp
	txn := &intent.Txn

I was looking at the output of go build -gcflags "-m -m" and this causes intent.Txn to escape to the heap.
I am assuming that the *Intent passed in may not be heap allocated, so it would be desirable to prevent this escape. If yes, should this code keep using intent.Txn until it needs to save it in the holder and then make a copy -- and is the right way to copy TxnMeta to use proto.Merge()?


pkg/storage/concurrency/lock_table.go, line 1884 at r3 (raw file):

			// NB: !resumingInSameSpan
			tree.mu.RLock()
			i := tree.Get(&lockState{key: span.Key})

the tree.Get() calls are causing lockState to escape, which looks like the main cost wrt both cpu and memory allocations when running a trivial benchmark that has no contention. Is this because lockState is being used as an interface in the callee and that causes escape analysis to not realize that this doesn't need heap allocation?

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @sumeerbhola)


pkg/storage/concurrency/lock_table.go, line 1361 at r3 (raw file):

I am assuming that the *Intent passed in may not be heap allocated, so it would be desirable to prevent this escape.

This should actually be fine. We're going to be passing in a []roachpb.Intent into the concurrency manager (through an interface), so the roachpb.Intent structs will already be on the heap. As long as we iterate over these correctly and pass references directly to the slice elements, we won't be causing any new allocations.


pkg/storage/concurrency/lock_table.go, line 1884 at r3 (raw file):

Previously, sumeerbhola wrote…

the tree.Get() calls are causing lockState to escape, which looks like the main cost wrt both cpu and memory allocations when running a trivial benchmark that has no contention. Is this because lockState is being used as an interface in the callee and that causes escape analysis to not realize that this doesn't need heap allocation?

Yes, and there's no good way to avoid this without stashing the lockState objects on lockTableImpl, but that won't work because you're only holding a shared lock. The general rule in Go is: "if you're using interfaces, you're working with heap memory and causing heap allocations". This is one of the many things that will improve when we switch to the generated btree code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants