Add `loopvarcapture` linter by renatolabs · Pull Request #82324 · cockroachdb/cockroach

renatolabs · 2022-06-01T23:40:29Z

This PR introduces a new linter: loopvarcapture. It reports
uses of loop variables captured by reference in Go routines or defer
statements, a common source of data races [1].

govet currently has a similar linter [2]; however, that project
prioritizes having no false positives at the expense of allowing
false negatives. This linter, on the other hand, represents the
opinion that loop variables should not be captured by reference in Go
routines even when it's safe to do so. That behavior is confusing and
concurrency added to related code over time could lead to the
introduction of data races, potentially manifesting as bugs in the
product or flakiness in the tests. These issues are hard to debug and
take a lot of developer time.

For instance, the following code will be identified by the linter:

for i, obj := range collection {
    go func() {
        // ...
        myFunc(i) // variable 'i' captured by reference
        // ...
    }()

    showProgress := func() {
        fmt.Printf("finished processing: %#v\n", obj)
    }

    go func() {
        defer showProgress() // function 'showProgress' captures loop variable 'obj' by reference
        // ...
    }()
}

To fix such cases, the code is expected to not use the loop variable in the
closure, as in the examples below:

for i, obj := range collection {
    i := i // create copy of index variable
    go func() {
        // ...
        myFunc(i) // this is OK -- using loop's copy of the index variable
        // ...
    }()

    showProgress := func(obj MyStruct) {
        fmt.Printf("finished processing: %#v\n", obj)
    }

    go func(obj MyStruct) {
        defer showProgress(obj) // this is OK -- passing loop variable as parameter to the closure
        // ...
    }(obj)
}

Developers are still able to use their own judgement and disable this
linter in specific instances by using a nolint comment.

The commits after the first one fix violations found by this new linter in multiple
packages; all of the violations happened in test files. In most cases, the fixes
did not actually solve data races because most of our unit tests run
sequentially. There was a minor roachtest bug fix related to progress
logging.

Apologies for updating this many packages in one PR; the majority of the
changes were mechanical and don't require much review by the different teams.
Reviewing of the first commit should be mostly done by @cockroachdb/test-eng.

[1] A Study of Real-World Data Races in Golang: https://arxiv.org/pdf/2204.00764.pdf
[2] https://github.com/golangci/govet/blob/44ddbe260190d79165f4150b828650780405d801/rangeloop.go#L36

cockroach-teamcity · 2022-06-01T23:40:44Z

This change is

ajwerner · 2022-06-02T00:30:27Z

Take or leave this feedback: I wish this thing was just a tad more sophisticated: I wish it would accept it in the case where the variable is never written to again. In practice that's a tad complicated. I guess you'd want to see if the variable were captured in a different closure and to see if its address were taken and then generally looking for assignments. In the vast majority of the changes you have to make here, the variables are never changed after their initial assignment.

All that being said, nice job writing a linter. The analysis framework is cool. @cockroachdb/sql-schema code LGTM

andreimatei · 2022-06-02T02:32:12Z

This linter, on the other hand, represents the
opinion that loop variables should not be captured by reference in Go
routines even when it's safe to do so.

This linter's opinion does not correspond to mine :).

That behavior is confusing and
concurrency added to related code over time could lead to the
introduction of data races, potentially manifesting as bugs in the
product or flakiness in the tests. These issues are hard to debug and
take a lot of developer time.

I checked a couple of the cases touched by this PR, and in all of them adding some sort of new concurrency would cause all sorts of problems unrelated to the loop variable. It's very common to have iterations of a loop that use goroutines internally, and wait for all these goroutines to finish before moving to the next iteration. In fact, I think you'll find this is the common case. I don't want this linter in the way of this pattern.

So I for one think I'm against this linter. Tell me the numbers again: how many bugs were actually fixed here, versus how many false positives?

We already have too many linters; let's think twice before adding more.

If this does go forward after all, at the very least let's fix the "violations" by making a copy of the variable in question inside the loop and have the closure capture the copy (or by using the index-only range iteration), rather than introducing parameters to the closures.

Also, reorder the new linter commit at the end, so that the tests always pass (for bisections for example).

rickystewart

Drive-by: add the package to the existing list of lint checks in the top-level BUILD.bazel file. No opinion expressed on the linter itself.

renatolabs · 2022-06-03T14:13:47Z

Take or leave this feedback: I wish this thing was just a tad more sophisticated: I wish it would accept it in the case where the variable is never written to again.

@ajwerner Not sure I understand your comment here. Why would it matter whether the variable is written to again? This linter only applies to loop variables (which are "assigned" in each iteration of the loop). As long as there's a chance that the Go routine will outlive the iteration where it was created, there's a data race.

Let me know if I'm missing the point you were making.

renatolabs · 2022-06-03T14:32:18Z

@andreimatei thank you for your feedback 👍 My opinion is that loop variables being captured by reference in Go routines increase the chances of data races even if it's safe to do so at the time they were introduced. At the same time, I also understand your point that the pattern of creating a Go routine and waiting for it to finish in the same iteration is a common pattern, and I didn't want this linter to get too much in the way of that.

In fact, I think you'll find this is the common case. I don't want this linter in the way of this pattern.

This is indeed the common case as I mentioned in the commit messages and PR description. The majority (if not all) of the changes here don't fix data races that exist at this point in time either because the tests run sequentially or because there's extra synchronization involved.

I wanted to point out that I saw this linter not as a guaranteed data-race detector, but as an indicator of a potentially dangerous pattern. IMO (and as substantiated by the paper linked in the PR description), this behaviour of closures capturing variables in scope by reference is not obvious to most people and lead to subtle bugs. This can happen even when it looks like we're waiting for the Go routines to finish at the end of the loop, like below:

for _, tc := range array {
  // do things
  wg.Add(len(tc.queries))
  for _, query := range tc.queries {
    go func() {
      defer wg.Done()
      doSomethingWithQuery(query)
    }()
  }

  wg.Wait()
  // etc
}

Maybe the issue is obvious when looking at the code above because we have our minds set to look for that pattern and beause the code is not complex enough; in more realistic scenarios, I think it's easy to miss the data race there.

All of that said, I'm open to hearing everyone's opinions and if most people think it's better to not have the linter, I'm fine with that as well. 👍

Your comment also gave me an idea to change the linter so that it will recognize "obvious" synchronization happening in the loop. The linter could not report issues if one of the following is true:

the closure sends to a channel that is then read after in the same loop
the closure calls Done on a wait group that is then Waited in the same loop.

I prototyped this change, and the vast majority of test cases changed here would not need to be changed if we implemented it like that. The linter will not be able to detect more sophisticated synchronization mechanisms, but I think this gets us a long way.

Happy to hear thoughts and opinions on this approach or on anything else here. 🙂

erikgrinaker · 2022-06-03T14:37:06Z

I'm +1 on this. Capturing loop variables is a frequent and subtle footgun, especially for newcomers to Go. The upside here definitely outweighs the downside.

srosenberg

Reviewed 4 of 6 files at r1, 2 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @renatolabs)

pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 61 at r2 (raw file):

			pass.Report(issue)
		}
		return false

What about nested loops?

pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 75 at r2 (raw file):

	pass *analysis.Pass

	// closures keeps a map from local closures that capture loop

Nit: maps a closure to the captured-by-reference loop variable.

pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 105 at r2 (raw file):

// 'closures' map is updated.
//
// When a `go` statement is found, we look for function literals

May want to tighten this a bit by emphasizing that only the body of the closure is checked; specifically, the actuals passed into the closure formals are not checked; i.e., the following is ok because the closure has no "free" vars.

for k, v := range myMap {
      go func(k, v) {
             fmt.Printf("k = %v, v = %v\n", k, v)
      })(k, v)
}

pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 146 at r2 (raw file):

			}

			ast.Inspect(rhs, v.funcLitInspector(func(id *ast.Ident) {

Might be worth adding above ast.Inspect(...),

// Inspect closure's body, looking for captured variables; if found, store the mapping below.

pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 173 at r2 (raw file):

}

// findVariableCaptures returns a function that can be passed to

Nit,

// findVariableCaptures inspects closure's body. When a reference to a loop variable is found...

pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 178 at r2 (raw file):

// reference is called, the `onInvalidReference` function passed is
// called.
func (v *Visitor) findVariableCaptures(onInvalidReference func(*ast.Ident)) func(ast.Node) bool {

I wonder if it's more precise to call it "findFreeVarsInClosure" since "free" and "bound" variables have formal definitions [1] whereas "captured" is less formally defined :)

[1] https://en.wikipedia.org/wiki/Free_variables_and_bound_variables

pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 182 at r2 (raw file):

		switch expr := n.(type) {
		case *ast.Ident:
			if expr.Obj == nil {

Curious, when is it nil?

pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 187 at r2 (raw file):

			for _, loopVar := range v.loop.Vars {
				if expr.Obj == loopVar.Obj {

This implies that the loop variable occurs freely inside the closure's body by virtue of having the same Obj; might be worth a comment since Obj is not super well-defined in ast.go.

pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 194 at r2 (raw file):

			return false

		case *ast.CallExpr:

I wonder if it's better to report the issue at the declaration site? We know that the function declaration has to be used, otherwise the compiler will raise an error. However, if it's a false-positive, and the function is used frequently, then the comment to suppress it must be copied several times.

pkg/testutils/lint/passes/goroutinerefcapture/testdata/src/p/p.go line 208 at r2 (raw file):

			//nolint:goroutinerefcapture
			badClosure()

I wonder if it's better to report the issue inside badClosure's body, i.e., at the declaration site. This way, if it's a false-positive, only one comment is needed to suppress the linter vs. k for every kth invocation.

andreimatei

I wanted to point out that I saw this linter not as a guaranteed data-race detector, but as an indicator of a potentially dangerous pattern. IMO (and as substantiated by the paper linked in the PR description), this behaviour of closures capturing variables in scope by reference is not obvious to most people and lead to subtle bug

From the paper:
We have deployed this detector in Uber’s 46 million lines of
Go codebase hosting 2100 distinct microservices, found over
2000 data races, and fixed over 1000 data races, spanning 790
distinct code patches submitted by 210 unique developers
over a six-month period. Based on a detailed investigation
of these data race patterns in Go, we make seven high-level
observations relating to the complex interplay between the
Go language paradigm and data races.

So their study is on races that are detected by standard tooling. We use the respective race detector. And it's being upgraded to tsan v3 in go1.19, which is supposed to make it significantly better still.
I personally remain unconvinced that the benefits here outweigh the risks. Did Uber itself implement something like this?

I prototyped this change, and the vast majority of test cases changed here would not need to be changed if we implemented it like that. The linter will not be able to detect more sophisticated synchronization mechanisms, but I think this gets us a long way.

Consider also ctxgroup.Group and errorgroup.Group as popular synchronization mechanisms.
But, I personally would much prefer the dumber govet test. Do we already run that one?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @renatolabs)

srosenberg

I wonder if still need to run golang.org/x/tools/go/analysis/passes/loopclosure? Technically, this linter subsumes the inspection of go, but it leaves defer open.

Reviewed 2 of 6 files at r1, 4 of 4 files at r3, 1 of 1 files at r4, 4 of 4 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, 10 of 10 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @renatolabs)

srosenberg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @renatolabs)

pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 139 at r8 (raw file):

		}

	case *ast.AssignStmt:

Looks like it might be missing IncDecStmt as done in loopclosure [1].

[1] https://cs.opensource.google/go/x/tools/+/refs/tags/v0.1.10:go/analysis/passes/loopclosure/loopclosure.go;l=73

stevendanna

It's interesting that most of the violations it found here are in test code.

Regarding the overall usefulness of this linter, one of my main questions would be how long this takes to run on some of the larger packages. I imagine it is rather fast.

I'm OK turning this linter on. If we can't get to agreement, perhaps we could still commit this and make it easy to run occasionally (some of the features of the nilness linter would also benefit from this)

stevendanna · 2022-06-07T08:46:40Z

+// loop variable, that is also reported.
+func (v *Visitor) visitLoopBody(n ast.Node) bool {
+	switch stmt := n.(type) {
+	case *ast.GoStmt:


Should we check *ast.DeferStmt as well?

This commit removes references to loop variables in Go routines in `ccl`. This is in preparation for a linter that will detect such references, in an attempt to reduce the chances of introducing data races. Even though the captures fixed here are safe (since there was synchronization involved), we want to avoid implicitly capturing loop variables in Go routines whenever possible. Release note: None.

This is in preparation for a linter that will detect such references. Release note: None.

Release note: None.

Even though this use was safe (due to the use of the WaitGroup), capturing loop variables by reference in Go routines should not be encouraged. Release note: None.

Release note: None.

This commit introduces a new linter: `loopvarcapture`. It reports uses of loop variables captured by reference in Go routines or defer statements, a common source of data races [1]. `govet` currently has a similar linter [2]; however, that project prioritizes having no false positives at the expense of allowing false negatives. This linter, on the other hand, represents the opinion that loop variables should not be captured by reference in Go routines even when it's safe to do so. That behavior is confusing and concurrency added to related code over time could lead to the introduction of data races, potentially manifesting as bugs in the product or flakiness in the tests. These issues are hard to debug and take a lot of developer time. Developers are still able to use their own judgement and disable this linter in specific instances by using a `nolint` comment. [1] A Study of Real-World Data Races in Golang: https://arxiv.org/pdf/2204.00764.pdf [2] https://github.com/golangci/govet/blob/44ddbe260190d79165f4150b828650780405d801/rangeloop.go#L36 Resolves: cockroachdb#80803. Release note: None.

craig · 2022-06-23T01:39:15Z

Canceled.

renatolabs · 2022-06-23T14:28:42Z

bors r=srosenberg

craig · 2022-06-23T15:56:12Z

Build succeeded:

GitHub CI (Cockroach)

andreimatei · 2022-06-23T23:22:23Z

This seems to have triggered a false positive on a random PR in bors:
https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests/5548391?showRootCauses=true&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

pkg/server/api_v2_sql.go:411:59: loop variable 'stmtIdx' captured by reference, and may hold an undesirable value by the time the deferred function is called

The PR in question is:
https://reviewable.io/reviews/cockroachdb/cockroach/79663#-N0XWFS5CidRv8Q3C4t1

Line 411 is retErr = errors.Wrapf(retErr, "executing stmt %d", stmtIdx+1).

I think there's no goroutines at play there, just two nested anonymous functions. As far as I can see, at least.

srosenberg · 2022-06-23T23:39:38Z

The PR in question is: https://reviewable.io/reviews/cockroachdb/cockroach/79663#-N0XWFS5CidRv8Q3C4t1

Line 411 is retErr = errors.Wrapf(retErr, "executing stmt %d", stmtIdx+1).

I think there's no goroutines at play there, just two nested anonymous functions. As far as I can see, at least.

It's captured inside the defer statement which is subject to the same rules as goroutines. If this is benign, the author can suppress it via,

//nolint:loopvarcapture

andreimatei · 2022-06-24T01:37:20Z

But look at the code. The defer does not outlive the loop iteration; it's inside a function called inline.

srosenberg · 2022-06-24T02:29:34Z

But look at the code. The defer does not outlive the loop iteration; it's inside a function called inline.

That's true; this defer is always executed sequentially. @renatolabs Thoughts on whether it's easy to allow these cases? i.e., exclude defer when it's declared inside an applied closure?

Personally, I would refactor to this form,

stmtErr := func(int i) (retErr error) {
   ...
   txnRes.Statement = i
   ...
   defer func() {
        ...
        retErr = errors.Wrapf(retErr, "executing stmt %d", i)
        ...
   }()
   ...
}(stmtIdx+1)

Not only does it improve readability by removing the common subexpression (stmtIdx+1), its (operational) semantics are much more lucid, imo. It's not immediately obvious that the deeply nested defer is executed on each iteration; maybe it's just me, but I had to stare at the code for a good 30+ seconds to prove that it's free of unwanted side-effects.

renatolabs · 2022-06-24T13:05:53Z

I'm aware that the linter's treatment of defer can be improved in cases where it's contained in a closure within the loop; this was on my list. Thanks for reporting, I'll take a look at it soon.

andreimatei · 2022-06-24T15:46:51Z

Personally, I would refactor to this form,

This rewrite is worse, in my opinion. You're introducing an argument (i) for which you have to scroll down two screens to see what it's assigned to. That's part of the reason why I was against this linter - it encourages doing this, which can be harmful to readability.

@renatolabs given that this false-positive was triggered by, I believe, literally the first bors batch after merging the linter, can I convince you to revert it until you fix it (if the fix is not imminent)? The use of anonymous functions called inline for the explicit purpose of putting a defer in them is a useful and common Go idiom, and I think we definitely don't want to flag it.

knz · 2022-06-24T15:59:25Z

The use of anonymous functions called inline for the explicit purpose of putting a defer in them is a useful and common Go idiom

Yep, that's true. For example, I just merged PR #83083 that added a dozen of those.

srosenberg · 2022-06-24T16:41:29Z

The use of anonymous functions called inline for the explicit purpose of putting a defer in them is a useful and common Go idiom

Yep, that's true. For example, I just merged PR #83083 that added a dozen of those.

This is a good example from that PR,

// Close releases the resources used by the IndexBackfiller.
func (ib *IndexBackfiller) Close(ctx context.Context) {
	if ib.mon != nil {
		func() {
			ib.muBoundAccount.Lock()
			defer ib.muBoundAccount.Unlock()
			ib.muBoundAccount.boundAccount.Close(ctx)
		}()
		ib.mon.Stop(ctx)
	}
}

renatolabs · 2022-06-27T15:27:47Z

There's a PR up to handle the situation discussed here: #83418.

andreimatei · 2022-06-27T20:06:57Z

Thank you!

renatolabs requested review from a team as code owners June 1, 2022 23:40

renatolabs requested review from a team June 1, 2022 23:40

renatolabs requested a review from a team as a code owner June 1, 2022 23:40

renatolabs requested review from dt and removed request for a team June 1, 2022 23:40

renatolabs force-pushed the 80803-capture-by-ref-linter branch from a5bd265 to 2390a7a Compare June 2, 2022 00:01

tbg requested a review from srosenberg June 2, 2022 11:22

rickystewart reviewed Jun 2, 2022

View reviewed changes

srosenberg reviewed Jun 6, 2022

View reviewed changes

andreimatei reviewed Jun 6, 2022

View reviewed changes

srosenberg reviewed Jun 6, 2022

View reviewed changes

stevendanna reviewed Jun 7, 2022

View reviewed changes

renatolabs force-pushed the 80803-capture-by-ref-linter branch from 2390a7a to b9f5d8b Compare June 9, 2022 14:18

renatolabs requested review from a team as code owners June 9, 2022 14:18

renatolabs requested review from gh-casper and removed request for a team June 9, 2022 14:19

Renato Costa added 10 commits June 22, 2022 21:36

roachtest: remove reference to loop variable in Go routine.

b1c902f

This is in preparation for a linter that will detect such references. Release note: None.

kv: remove references to loop variables in Go routines.

8597216

Release note: None.

rpc: don't capture loop variable in Go routine in test.

9716234

Even though this use was safe (due to the use of the WaitGroup), capturing loop variables by reference in Go routines should not be encouraged. Release note: None.

server: don't capture loop variable by reference in test.

0daa1b7

Release note: None.

sql: remove capture by reference of loop variables in test Go routines.

e428136

Release note: None.

bench: remove loop variable captures from defer statement.

81bbfb0

Release note: None.

storage: do not capture loop variables in defer statement.

ec86b78

Release note: None.

workload: do not reference loop variables in Go routine.

b0bb1e2

Release note: None.

renatolabs force-pushed the 80803-capture-by-ref-linter branch from b61dffa to b3d21b2 Compare June 23, 2022 01:39

renatolabs requested review from a team and rhu713 June 23, 2022 01:39

craig Bot merged commit 9e47dc9 into cockroachdb:master Jun 23, 2022

renatolabs deleted the 80803-capture-by-ref-linter branch June 23, 2022 16:23

andreimatei mentioned this pull request Jun 23, 2022

server: new HTTP API to execute SQL statements #79663

Merged

renatolabs mentioned this pull request Aug 25, 2022

linter: add synchronization detection to loopvarcapture #86836

Closed

Conversation

renatolabs commented Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jun 1, 2022

Uh oh!

ajwerner commented Jun 2, 2022

Uh oh!

andreimatei commented Jun 2, 2022

Uh oh!

rickystewart left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs commented Jun 3, 2022

Uh oh!

renatolabs commented Jun 3, 2022

Uh oh!

erikgrinaker commented Jun 3, 2022

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

stevendanna left a comment

Choose a reason for hiding this comment

Uh oh!

stevendanna Jun 7, 2022

Choose a reason for hiding this comment

Uh oh!

craig Bot commented Jun 23, 2022

Uh oh!

renatolabs commented Jun 23, 2022

Uh oh!

craig Bot commented Jun 23, 2022

Uh oh!

andreimatei commented Jun 23, 2022

Uh oh!

srosenberg commented Jun 23, 2022

Uh oh!

andreimatei commented Jun 24, 2022

Uh oh!

srosenberg commented Jun 24, 2022

Uh oh!

renatolabs commented Jun 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreimatei commented Jun 24, 2022

Uh oh!

knz commented Jun 24, 2022

Uh oh!

srosenberg commented Jun 24, 2022

Uh oh!

renatolabs commented Jun 27, 2022

Uh oh!

andreimatei commented Jun 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

renatolabs commented Jun 1, 2022 •

edited

Loading

renatolabs commented Jun 24, 2022 •

edited

Loading