Skip to content

Add loopvarcapture linter#82324

Merged
craig[bot] merged 10 commits intocockroachdb:masterfrom
renatolabs:80803-capture-by-ref-linter
Jun 23, 2022
Merged

Add loopvarcapture linter#82324
craig[bot] merged 10 commits intocockroachdb:masterfrom
renatolabs:80803-capture-by-ref-linter

Conversation

@renatolabs
Copy link
Copy Markdown

@renatolabs renatolabs commented Jun 1, 2022

This PR introduces a new linter: loopvarcapture. It reports
uses of loop variables captured by reference in Go routines or defer
statements, a common source of data races [1].

govet currently has a similar linter [2]; however, that project
prioritizes having no false positives at the expense of allowing
false negatives. This linter, on the other hand, represents the
opinion that loop variables should not be captured by reference in Go
routines even when it's safe to do so. That behavior is confusing and
concurrency added to related code over time could lead to the
introduction of data races, potentially manifesting as bugs in the
product or flakiness in the tests. These issues are hard to debug and
take a lot of developer time.

For instance, the following code will be identified by the linter:

for i, obj := range collection {
    go func() {
        // ...
        myFunc(i) // variable 'i' captured by reference
        // ...
    }()

    showProgress := func() {
        fmt.Printf("finished processing: %#v\n", obj)
    }

    go func() {
        defer showProgress() // function 'showProgress' captures loop variable 'obj' by reference
        // ...
    }()
}

To fix such cases, the code is expected to not use the loop variable in the
closure, as in the examples below:

for i, obj := range collection {
    i := i // create copy of index variable
    go func() {
        // ...
        myFunc(i) // this is OK -- using loop's copy of the index variable
        // ...
    }()

    showProgress := func(obj MyStruct) {
        fmt.Printf("finished processing: %#v\n", obj)
    }

    go func(obj MyStruct) {
        defer showProgress(obj) // this is OK -- passing loop variable as parameter to the closure
        // ...
    }(obj)
}

Developers are still able to use their own judgement and disable this
linter in specific instances by using a nolint comment.

The commits after the first one fix violations found by this new linter in multiple
packages; all of the violations happened in test files. In most cases, the fixes
did not actually solve data races because most of our unit tests run
sequentially. There was a minor roachtest bug fix related to progress
logging.

Apologies for updating this many packages in one PR; the majority of the
changes were mechanical and don't require much review by the different teams.
Reviewing of the first commit should be mostly done by @cockroachdb/test-eng.

[1] A Study of Real-World Data Races in Golang: https://arxiv.org/pdf/2204.00764.pdf
[2] https://github.com/golangci/govet/blob/44ddbe260190d79165f4150b828650780405d801/rangeloop.go#L36

@renatolabs renatolabs requested review from a team as code owners June 1, 2022 23:40
@renatolabs renatolabs requested review from a team June 1, 2022 23:40
@renatolabs renatolabs requested a review from a team as a code owner June 1, 2022 23:40
@renatolabs renatolabs requested review from dt and removed request for a team June 1, 2022 23:40
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@renatolabs renatolabs force-pushed the 80803-capture-by-ref-linter branch from a5bd265 to 2390a7a Compare June 2, 2022 00:01
@ajwerner
Copy link
Copy Markdown
Contributor

ajwerner commented Jun 2, 2022

Take or leave this feedback: I wish this thing was just a tad more sophisticated: I wish it would accept it in the case where the variable is never written to again. In practice that's a tad complicated. I guess you'd want to see if the variable were captured in a different closure and to see if its address were taken and then generally looking for assignments. In the vast majority of the changes you have to make here, the variables are never changed after their initial assignment.

All that being said, nice job writing a linter. The analysis framework is cool. @cockroachdb/sql-schema code LGTM

@andreimatei
Copy link
Copy Markdown
Contributor

This linter, on the other hand, represents the
opinion that loop variables should not be captured by reference in Go
routines even when it's safe to do so.

This linter's opinion does not correspond to mine :).

That behavior is confusing and
concurrency added to related code over time could lead to the
introduction of data races, potentially manifesting as bugs in the
product or flakiness in the tests. These issues are hard to debug and
take a lot of developer time.

I checked a couple of the cases touched by this PR, and in all of them adding some sort of new concurrency would cause all sorts of problems unrelated to the loop variable. It's very common to have iterations of a loop that use goroutines internally, and wait for all these goroutines to finish before moving to the next iteration. In fact, I think you'll find this is the common case. I don't want this linter in the way of this pattern.

So I for one think I'm against this linter. Tell me the numbers again: how many bugs were actually fixed here, versus how many false positives?

We already have too many linters; let's think twice before adding more.


If this does go forward after all, at the very least let's fix the "violations" by making a copy of the variable in question inside the loop and have the closure capture the copy (or by using the index-only range iteration), rather than introducing parameters to the closures.

Also, reorder the new linter commit at the end, so that the tests always pass (for bisections for example).

@tbg tbg requested a review from srosenberg June 2, 2022 11:22
Copy link
Copy Markdown
Collaborator

@rickystewart rickystewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by: add the package to the existing list of lint checks in the top-level BUILD.bazel file. No opinion expressed on the linter itself.

@renatolabs
Copy link
Copy Markdown
Author

Take or leave this feedback: I wish this thing was just a tad more sophisticated: I wish it would accept it in the case where the variable is never written to again.

@ajwerner Not sure I understand your comment here. Why would it matter whether the variable is written to again? This linter only applies to loop variables (which are "assigned" in each iteration of the loop). As long as there's a chance that the Go routine will outlive the iteration where it was created, there's a data race.

Let me know if I'm missing the point you were making.

@renatolabs
Copy link
Copy Markdown
Author

@andreimatei thank you for your feedback 👍 My opinion is that loop variables being captured by reference in Go routines increase the chances of data races even if it's safe to do so at the time they were introduced. At the same time, I also understand your point that the pattern of creating a Go routine and waiting for it to finish in the same iteration is a common pattern, and I didn't want this linter to get too much in the way of that.

In fact, I think you'll find this is the common case. I don't want this linter in the way of this pattern.

This is indeed the common case as I mentioned in the commit messages and PR description. The majority (if not all) of the changes here don't fix data races that exist at this point in time either because the tests run sequentially or because there's extra synchronization involved.

I wanted to point out that I saw this linter not as a guaranteed data-race detector, but as an indicator of a potentially dangerous pattern. IMO (and as substantiated by the paper linked in the PR description), this behaviour of closures capturing variables in scope by reference is not obvious to most people and lead to subtle bugs. This can happen even when it looks like we're waiting for the Go routines to finish at the end of the loop, like below:

for _, tc := range array {
  // do things
  wg.Add(len(tc.queries))
  for _, query := range tc.queries {
    go func() {
      defer wg.Done()
      doSomethingWithQuery(query)
    }()
  }

  wg.Wait()
  // etc
}

Maybe the issue is obvious when looking at the code above because we have our minds set to look for that pattern and beause the code is not complex enough; in more realistic scenarios, I think it's easy to miss the data race there.


All of that said, I'm open to hearing everyone's opinions and if most people think it's better to not have the linter, I'm fine with that as well. 👍

Your comment also gave me an idea to change the linter so that it will recognize "obvious" synchronization happening in the loop. The linter could not report issues if one of the following is true:

  • the closure sends to a channel that is then read after in the same loop
  • the closure calls Done on a wait group that is then Waited in the same loop.

I prototyped this change, and the vast majority of test cases changed here would not need to be changed if we implemented it like that. The linter will not be able to detect more sophisticated synchronization mechanisms, but I think this gets us a long way.

Happy to hear thoughts and opinions on this approach or on anything else here. 🙂

@erikgrinaker
Copy link
Copy Markdown
Contributor

I'm +1 on this. Capturing loop variables is a frequent and subtle footgun, especially for newcomers to Go. The upside here definitely outweighs the downside.

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 6 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @renatolabs)


pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 61 at r2 (raw file):

			pass.Report(issue)
		}
		return false

What about nested loops?


pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 75 at r2 (raw file):

	pass *analysis.Pass

	// closures keeps a map from local closures that capture loop

Nit: maps a closure to the captured-by-reference loop variable.


pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 105 at r2 (raw file):

// 'closures' map is updated.
//
// When a `go` statement is found, we look for function literals

May want to tighten this a bit by emphasizing that only the body of the closure is checked; specifically, the actuals passed into the closure formals are not checked; i.e., the following is ok because the closure has no "free" vars.

for k, v := range myMap {
      go func(k, v) {
             fmt.Printf("k = %v, v = %v\n", k, v)
      })(k, v)
}

pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 146 at r2 (raw file):

			}

			ast.Inspect(rhs, v.funcLitInspector(func(id *ast.Ident) {

Might be worth adding above ast.Inspect(...),

// Inspect closure's body, looking for captured variables; if found, store the mapping below.


pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 173 at r2 (raw file):

}

// findVariableCaptures returns a function that can be passed to

Nit,

// findVariableCaptures inspects closure's body. When a reference to a loop variable is found...


pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 178 at r2 (raw file):

// reference is called, the `onInvalidReference` function passed is
// called.
func (v *Visitor) findVariableCaptures(onInvalidReference func(*ast.Ident)) func(ast.Node) bool {

I wonder if it's more precise to call it "findFreeVarsInClosure" since "free" and "bound" variables have formal definitions [1] whereas "captured" is less formally defined :)

[1] https://en.wikipedia.org/wiki/Free_variables_and_bound_variables


pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 182 at r2 (raw file):

		switch expr := n.(type) {
		case *ast.Ident:
			if expr.Obj == nil {

Curious, when is it nil?


pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 187 at r2 (raw file):

			for _, loopVar := range v.loop.Vars {
				if expr.Obj == loopVar.Obj {

This implies that the loop variable occurs freely inside the closure's body by virtue of having the same Obj; might be worth a comment since Obj is not super well-defined in ast.go.


pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 194 at r2 (raw file):

			return false

		case *ast.CallExpr:

I wonder if it's better to report the issue at the declaration site? We know that the function declaration has to be used, otherwise the compiler will raise an error. However, if it's a false-positive, and the function is used frequently, then the comment to suppress it must be copied several times.


pkg/testutils/lint/passes/goroutinerefcapture/testdata/src/p/p.go line 208 at r2 (raw file):

			//nolint:goroutinerefcapture
			badClosure()

I wonder if it's better to report the issue inside badClosure's body, i.e., at the declaration site. This way, if it's a false-positive, only one comment is needed to suppress the linter vs. k for every kth invocation.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to point out that I saw this linter not as a guaranteed data-race detector, but as an indicator of a potentially dangerous pattern. IMO (and as substantiated by the paper linked in the PR description), this behaviour of closures capturing variables in scope by reference is not obvious to most people and lead to subtle bug

From the paper:
We have deployed this detector in Uber’s 46 million lines of
Go codebase hosting 2100 distinct microservices, found over
2000 data races, and fixed over 1000 data races, spanning 790
distinct code patches submitted by 210 unique developers
over a six-month period. Based on a detailed investigation
of these data race patterns in Go, we make seven high-level
observations relating to the complex interplay between the
Go language paradigm and data races.

So their study is on races that are detected by standard tooling. We use the respective race detector. And it's being upgraded to tsan v3 in go1.19, which is supposed to make it significantly better still.
I personally remain unconvinced that the benefits here outweigh the risks. Did Uber itself implement something like this?

I prototyped this change, and the vast majority of test cases changed here would not need to be changed if we implemented it like that. The linter will not be able to detect more sophisticated synchronization mechanisms, but I think this gets us a long way.

Consider also ctxgroup.Group and errorgroup.Group as popular synchronization mechanisms.
But, I personally would much prefer the dumber govet test. Do we already run that one?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @renatolabs)

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if still need to run golang.org/x/tools/go/analysis/passes/loopclosure? Technically, this linter subsumes the inspection of go, but it leaves defer open.

Reviewed 2 of 6 files at r1, 4 of 4 files at r3, 1 of 1 files at r4, 4 of 4 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, 10 of 10 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @renatolabs)

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @renatolabs)


pkg/testutils/lint/passes/goroutinerefcapture/goroutinerefcapture.go line 139 at r8 (raw file):

		}

	case *ast.AssignStmt:

Looks like it might be missing IncDecStmt as done in loopclosure [1].

[1] https://cs.opensource.google/go/x/tools/+/refs/tags/v0.1.10:go/analysis/passes/loopclosure/loopclosure.go;l=73

Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that most of the violations it found here are in test code.

Regarding the overall usefulness of this linter, one of my main questions would be how long this takes to run on some of the larger packages. I imagine it is rather fast.

I'm OK turning this linter on. If we can't get to agreement, perhaps we could still commit this and make it easy to run occasionally (some of the features of the nilness linter would also benefit from this)

// loop variable, that is also reported.
func (v *Visitor) visitLoopBody(n ast.Node) bool {
switch stmt := n.(type) {
case *ast.GoStmt:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check *ast.DeferStmt as well?

@renatolabs renatolabs force-pushed the 80803-capture-by-ref-linter branch from 2390a7a to b9f5d8b Compare June 9, 2022 14:18
@renatolabs renatolabs requested review from a team as code owners June 9, 2022 14:18
@renatolabs renatolabs requested review from gh-casper and removed request for a team June 9, 2022 14:19
Renato Costa added 10 commits June 22, 2022 21:36
This commit removes references to loop variables in Go routines in
`ccl`. This is in preparation for a linter that will detect such
references, in an attempt to reduce the chances of introducing data
races.

Even though the captures fixed here are safe (since there was
synchronization involved), we want to avoid implicitly capturing loop
variables in Go routines whenever possible.

Release note: None.
This is in preparation for a linter that will detect such references.

Release note: None.
Even though this use was safe (due to the use of the WaitGroup),
capturing loop variables by reference in Go routines should not be
encouraged.

Release note: None.
This commit introduces a new linter: `loopvarcapture`. It reports
uses of loop variables captured by reference in Go routines or defer
statements, a common source of data races [1].

`govet` currently has a similar linter [2]; however, that project
prioritizes having no false positives at the expense of allowing
false negatives. This linter, on the other hand, represents the
opinion that loop variables should not be captured by reference in Go
routines even when it's safe to do so. That behavior is confusing and
concurrency added to related code over time  could lead to the
introduction of data races, potentially manifesting as bugs in the
product or flakiness in the tests. These issues are hard to debug and
take a lot of developer time.

Developers are still able to use their own judgement and disable this
linter in specific instances by using a `nolint` comment.

[1] A Study of Real-World Data Races in Golang: https://arxiv.org/pdf/2204.00764.pdf
[2] https://github.com/golangci/govet/blob/44ddbe260190d79165f4150b828650780405d801/rangeloop.go#L36

Resolves: cockroachdb#80803.

Release note: None.
@renatolabs renatolabs force-pushed the 80803-capture-by-ref-linter branch from b61dffa to b3d21b2 Compare June 23, 2022 01:39
@renatolabs renatolabs requested review from a team and rhu713 June 23, 2022 01:39
@craig
Copy link
Copy Markdown
Contributor

craig Bot commented Jun 23, 2022

Canceled.

@renatolabs
Copy link
Copy Markdown
Author

bors r=srosenberg

@craig
Copy link
Copy Markdown
Contributor

craig Bot commented Jun 23, 2022

Build succeeded:

@craig craig Bot merged commit 9e47dc9 into cockroachdb:master Jun 23, 2022
@renatolabs renatolabs deleted the 80803-capture-by-ref-linter branch June 23, 2022 16:23
@andreimatei
Copy link
Copy Markdown
Contributor

This seems to have triggered a false positive on a random PR in bors:
https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests/5548391?showRootCauses=true&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

pkg/server/api_v2_sql.go:411:59: loop variable 'stmtIdx' captured by reference, and may hold an undesirable value by the time the deferred function is called

The PR in question is:
https://reviewable.io/reviews/cockroachdb/cockroach/79663#-N0XWFS5CidRv8Q3C4t1

Line 411 is retErr = errors.Wrapf(retErr, "executing stmt %d", stmtIdx+1).

I think there's no goroutines at play there, just two nested anonymous functions. As far as I can see, at least.

@srosenberg
Copy link
Copy Markdown
Member

The PR in question is: https://reviewable.io/reviews/cockroachdb/cockroach/79663#-N0XWFS5CidRv8Q3C4t1

Line 411 is retErr = errors.Wrapf(retErr, "executing stmt %d", stmtIdx+1).

I think there's no goroutines at play there, just two nested anonymous functions. As far as I can see, at least.

It's captured inside the defer statement which is subject to the same rules as goroutines. If this is benign, the author can suppress it via,

//nolint:loopvarcapture

@andreimatei
Copy link
Copy Markdown
Contributor

But look at the code. The defer does not outlive the loop iteration; it's inside a function called inline.

@srosenberg
Copy link
Copy Markdown
Member

But look at the code. The defer does not outlive the loop iteration; it's inside a function called inline.

That's true; this defer is always executed sequentially. @renatolabs Thoughts on whether it's easy to allow these cases? i.e., exclude defer when it's declared inside an applied closure?

Personally, I would refactor to this form,

stmtErr := func(int i) (retErr error) {
   ...
   txnRes.Statement = i
   ...
   defer func() {
        ...
        retErr = errors.Wrapf(retErr, "executing stmt %d", i)
        ...
   }()
   ...
}(stmtIdx+1)

Not only does it improve readability by removing the common subexpression (stmtIdx+1), its (operational) semantics are much more lucid, imo. It's not immediately obvious that the deeply nested defer is executed on each iteration; maybe it's just me, but I had to stare at the code for a good 30+ seconds to prove that it's free of unwanted side-effects.

@renatolabs
Copy link
Copy Markdown
Author

renatolabs commented Jun 24, 2022

I'm aware that the linter's treatment of defer can be improved in cases where it's contained in a closure within the loop; this was on my list. Thanks for reporting, I'll take a look at it soon.

@andreimatei
Copy link
Copy Markdown
Contributor

Personally, I would refactor to this form,

This rewrite is worse, in my opinion. You're introducing an argument (i) for which you have to scroll down two screens to see what it's assigned to. That's part of the reason why I was against this linter - it encourages doing this, which can be harmful to readability.


@renatolabs given that this false-positive was triggered by, I believe, literally the first bors batch after merging the linter, can I convince you to revert it until you fix it (if the fix is not imminent)? The use of anonymous functions called inline for the explicit purpose of putting a defer in them is a useful and common Go idiom, and I think we definitely don't want to flag it.

@knz
Copy link
Copy Markdown
Contributor

knz commented Jun 24, 2022

The use of anonymous functions called inline for the explicit purpose of putting a defer in them is a useful and common Go idiom

Yep, that's true. For example, I just merged PR #83083 that added a dozen of those.

@srosenberg
Copy link
Copy Markdown
Member

The use of anonymous functions called inline for the explicit purpose of putting a defer in them is a useful and common Go idiom

Yep, that's true. For example, I just merged PR #83083 that added a dozen of those.

This is a good example from that PR,

// Close releases the resources used by the IndexBackfiller.
func (ib *IndexBackfiller) Close(ctx context.Context) {
	if ib.mon != nil {
		func() {
			ib.muBoundAccount.Lock()
			defer ib.muBoundAccount.Unlock()
			ib.muBoundAccount.boundAccount.Close(ctx)
		}()
		ib.mon.Stop(ctx)
	}
}

@renatolabs
Copy link
Copy Markdown
Author

There's a PR up to handle the situation discussed here: #83418.

@andreimatei
Copy link
Copy Markdown
Contributor

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.