Skip to content

Fix various data races in pkg/aws/eni and pkg/ipam#11685

Merged
aanm merged 7 commits intocilium:masterfrom
christarazi:pr/christarazi/fix-data-races-eni-tests
May 27, 2020
Merged

Fix various data races in pkg/aws/eni and pkg/ipam#11685
aanm merged 7 commits intocilium:masterfrom
christarazi:pr/christarazi/fix-data-races-eni-tests

Conversation

@christarazi
Copy link
Copy Markdown
Member

This PR also contains a potential fix for the flakiness seen in
#11560. The commit for the fix adds a
setup and cleanup method for a metrics mock implementation used in the
ENISuite.

See commit msgs.

Related #11560
Related #10677

@christarazi christarazi requested a review from a team as a code owner May 26, 2020 07:10
@christarazi christarazi requested a review from a team May 26, 2020 07:10
@christarazi christarazi added area/CI Continuous Integration testing issue or flake area/eni Impacts ENI based IPAM. kind/enhancement This would improve or streamline existing functionality. ci/flake This is a known failure that occurs in the tree. Please investigate me! release-note/misc This PR makes changes that have no direct user impact. labels May 26, 2020
@maintainer-s-little-helper
Copy link
Copy Markdown

Please set the appropriate release note label.

@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented May 26, 2020

test-me-please

Edit: K8s-1.17-Kernel-4.19 hit known flakes #11416 #11313

@coveralls
Copy link
Copy Markdown

coveralls commented May 26, 2020

Coverage Status

Coverage increased (+0.01%) to 36.941% when pulling 270b597 on christarazi:pr/christarazi/fix-data-races-eni-tests into e09d991 on cilium:master.

@christarazi
Copy link
Copy Markdown
Member Author

retest-4.9

@christarazi
Copy link
Copy Markdown
Member Author

retest-4.19

@christarazi christarazi marked this pull request as ready for review May 26, 2020 17:01
@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented May 26, 2020

Travis has passed thrice so far...

Edit: 8 times...

@joestringer joestringer requested a review from tgraf May 27, 2020 01:50
Comment thread pkg/aws/eni/node.go Outdated
Comment on lines 293 to 313
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd seeing a private method getting a mutext. Please add a comment in the function stating n.mutex must NOT be held.

Comment thread pkg/aws/eni/instances.go Outdated
Comment on lines 207 to 209
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use a defer

	m.mutex.RLock()
        defer m.mutex.RUnlock()
	m.instances.ForeachInterface(instanceID, fn)

@aanm
Copy link
Copy Markdown
Member

aanm commented May 27, 2020

@christarazi my requested changes won't require the CI to re-run again. Ping me once they are addressed so we can merge the PR without running the CI.

This fixes data races found when running the unit-tests with `-race`.
The mutex is to protect access to `n.enis`.

```
WARNING: DATA RACE
Write at 0x00c0007b40e0 by goroutine 135:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).ResyncInterfacesAndIPs()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:430 +0xfa
  github.com/cilium/cilium/pkg/ipam.(*Node).recalculate()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:352 +0xfb
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).resyncNode()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:342 +0x7a
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:389 +0x88

Previous read at 0x00c0007b40e0 by goroutine 46:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).findNextIndex()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:292 +0x86
  github.com/cilium/cilium/pkg/aws/eni.(*Node).CreateInterface()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:339 +0x584
  github.com/cilium/cilium/pkg/ipam.(*Node).createInterface()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:435 +0x290
  github.com/cilium/cilium/pkg/ipam.(*Node).maintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:628 +0x85d
  github.com/cilium/cilium/pkg/ipam.(*Node).MaintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:663 +0x82
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:241 +0x8b
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Goroutine 135 (running) created at:
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:388 +0x2c5
  github.com/cilium/cilium/pkg/ipam.NewNodeManager.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:168 +0x101
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Goroutine 46 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:129 +0x23d
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:236 +0x523
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerManyNodes()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:593 +0x80a
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9
```

```
WARNING: DATA RACE
Write at 0x00c000110060 by goroutine 94:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).ResyncInterfacesAndIPs()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:439 +0xfa
  github.com/cilium/cilium/pkg/ipam.(*Node).recalculate()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:352 +0xfb
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).resyncNode()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:342 +0x7a
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:389 +0x88

Previous read at 0x00c000110060 by goroutine 92:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).PrepareIPAllocation()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:211 +0xefc
  github.com/cilium/cilium/pkg/ipam.(*Node).determineMaintenanceAction()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:542 +0x184
  github.com/cilium/cilium/pkg/ipam.(*Node).maintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:578 +0x53
  github.com/cilium/cilium/pkg/ipam.(*Node).MaintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:663 +0x82
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:241 +0x8b
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Goroutine 94 (running) created at:
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:388 +0x2c5
  github.com/cilium/cilium/pkg/ipam.NewNodeManager.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:168 +0x101
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Goroutine 92 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:129 +0x23d
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:236 +0x523
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerManyNodes()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:593 +0x80a
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This commit mostly reworks `eni.(*Node).ResyncInterfacesAndIPs()` as it
makes many unprotected accesses to shared resources. This change is to
reduce the scope of the mutexes to only when a shared resource is
accessed. It also refactors that code such that only the immediate
structure is responsible for locking / unlocking its mutex. This allows
a mixtures of read and write locks intended to minimize the impact on
performance as much as possible. This change also attempts to minimize
the nesting of mutex locks as much as possible, which can lead to
deadlocking or long hanging periods.

Not all the data race output from Golang is provided in this commit body
as there were many small, nested data races that were fixed along the
way. The main, important data race outputs are provided below.

All data races were found by running `-race` when running `make
unit-tests`.

```
WARNING: DATA RACE
Read at 0x00c00017c388 by goroutine 86:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).ResyncInterfacesAndIPs()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:433 +0x1ca
  github.com/cilium/cilium/pkg/ipam.(*Node).recalculate()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:352 +0xfb
  github.com/cilium/cilium/pkg/ipam.(*Node).MaintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:669 +0xc1
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:241 +0x8b
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Previous write at 0x00c00017c388 by goroutine 82:
  github.com/cilium/cilium/pkg/aws/eni.(*InstancesManager).Resync()
      /home/chris/code/cilium/cilium/pkg/aws/eni/instances.go:185 +0xf24
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerManyNodes()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:589 +0xb98
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Goroutine 86 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:129 +0x23d
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:236 +0x523
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerManyNodes()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:593 +0x80a
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9
```

```
WARNING: DATA RACE
Write at 0x00c0001f3738 by goroutine 35:
  github.com/cilium/cilium/pkg/aws/eni.(*InstancesManager).Resync()
      /home/chris/code/cilium/cilium/pkg/aws/eni/instances.go:185 +0xf24
  github.com/cilium/cilium/pkg/ipam.NewNodeManager.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:167 +0x8f
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Previous read at 0x00c0001f3738 by goroutine 39:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).PopulateStatusFields()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:84 +0x11e
  github.com/cilium/cilium/pkg/ipam.(*Node).syncToAPIServer()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:701 +0x1ce
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update.func2()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:256 +0x41
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Goroutine 35 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:129 +0x23d
  github.com/cilium/cilium/pkg/ipam.NewNodeManager()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:162 +0x34f
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestGetNodeNames()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:71 +0x42f
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Goroutine 39 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:129 +0x23d
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:251 +0x938
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestGetNodeNames()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:91 +0x1411
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9
```

```
WARNING: DATA RACE
Write at 0x00c0007981f8 by goroutine 81:
  github.com/cilium/cilium/pkg/aws/eni.(*InstancesManager).Resync()
      /home/chris/code/cilium/cilium/pkg/aws/eni/instances.go:185 +0xf24
  github.com/cilium/cilium/pkg/ipam.NewNodeManager.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:167 +0x8f
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Previous read at 0x00c0007981f8 by goroutine 152:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).getSecurityGroupIDs()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:248 +0x1f2
  github.com/cilium/cilium/pkg/aws/eni.(*Node).CreateInterface()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:332 +0x275
  github.com/cilium/cilium/pkg/ipam.(*Node).createInterface()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:435 +0x290
  github.com/cilium/cilium/pkg/ipam.(*Node).maintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:628 +0x85d
  github.com/cilium/cilium/pkg/ipam.(*Node).MaintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:663 +0x82
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:241 +0x8b
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Goroutine 81 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:129 +0x23d
  github.com/cilium/cilium/pkg/ipam.NewNodeManager()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:162 +0x34f
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerManyNodes()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:578 +0x62d
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Goroutine 152 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:129 +0x23d
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:236 +0x523
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerManyNodes()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:593 +0x80a
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This data race occurs in SetRunningLocked() which has a comment stating
that it assumes that `Node` has been locked. That is not the case, and
this PR fixes that incorrect assumption by renaming the function to
SetRunning() and locking the mutex when writing to the shared resource.

This fixes a data race found when running the unit-tests with `-race`.

```
WARNING: DATA RACE
Write at 0x00c0002d1030 by goroutine 79:
  github.com/cilium/cilium/pkg/ipam.(*Node).SetRunningLocked()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:133 +0xc7
  github.com/cilium/cilium/pkg/aws/eni.(*Node).errorInstanceNotRunning()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:274 +0xfe
  github.com/cilium/cilium/pkg/aws/eni.(*Node).CreateInterface()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:385 +0xe9f
  github.com/cilium/cilium/pkg/ipam.(*Node).createInterface()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:435 +0x290
  github.com/cilium/cilium/pkg/ipam.(*Node).maintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:628 +0x85d
  github.com/cilium/cilium/pkg/ipam.(*Node).MaintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:663 +0x82
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:241 +0x8b
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Previous read at 0x00c0002d1030 by goroutine 77:
  github.com/cilium/cilium/pkg/ipam.(*Node).IsRunning()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:125 +0x85
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerInstanceNotRunning.func1()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:655 +0x6d
  github.com/cilium/cilium/pkg/testutils.WaitUntil()
      /home/chris/code/cilium/cilium/pkg/testutils/condition.go:36 +0x97
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerInstanceNotRunning()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:653 +0xf22
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Goroutine 79 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:129 +0x23d
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:236 +0x523
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerInstanceNotRunning()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:650 +0xed8
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Goroutine 77 (running) created at:
  gopkg.in/check%2ev1.(*suiteRunner).forkCall()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x44d
  gopkg.in/check%2ev1.(*suiteRunner).forkTest()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:763 +0x1b9
  gopkg.in/check%2ev1.(*suiteRunner).runTest()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:818 +0x228
  gopkg.in/check%2ev1.(*suiteRunner).run()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:624 +0x1c7
  gopkg.in/check%2ev1.Run()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:92 +0x5a
  gopkg.in/check%2ev1.RunAll()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:84 +0x127
  gopkg.in/check%2ev1.TestingT()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:72 +0x5f6
  github.com/cilium/cilium/pkg/aws/eni.Test()
      /home/chris/code/cilium/cilium/pkg/aws/eni/eni_test.go:26 +0x38
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:991 +0x1eb
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This fixes a data race found when running the unit-tests with `-race`.

```
WARNING: DATA RACE
Write at 0x00c000067168 by goroutine 118:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).UpdatedNode()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:69 +0x5f
  github.com/cilium/cilium/pkg/ipam.(*Node).UpdatedResource()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:319 +0x81
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:273 +0x14a
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerReleaseAddress()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:467 +0x101d
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Previous read at 0x00c000067168 by goroutine 324:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).getLimits()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:101 +0x5a
  github.com/cilium/cilium/pkg/aws/eni.(*Node).GetMaximumAllocatableIPv4()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:488 +0x2e2
  github.com/cilium/cilium/pkg/ipam.(*Node).getMaxAllocate()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:213 +0x79
  github.com/cilium/cilium/pkg/ipam.(*Node).recalculate()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:368 +0x42a
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).resyncNode()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:342 +0x7a
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:389 +0x88

Goroutine 118 (running) created at:
  gopkg.in/check%2ev1.(*suiteRunner).forkCall()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x44d
  gopkg.in/check%2ev1.(*suiteRunner).forkTest()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:763 +0x1b9
  gopkg.in/check%2ev1.(*suiteRunner).runTest()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:818 +0x228
  gopkg.in/check%2ev1.(*suiteRunner).run()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:624 +0x1c7
  gopkg.in/check%2ev1.Run()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:92 +0x5a
  gopkg.in/check%2ev1.RunAll()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:84 +0x127
  gopkg.in/check%2ev1.TestingT()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:72 +0x5f6
  github.com/cilium/cilium/pkg/aws/eni.Test()
      /home/chris/code/cilium/cilium/pkg/aws/eni/eni_test.go:26 +0x38
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:991 +0x1eb

Goroutine 324 (finished) created at:
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:388 +0x2c5
  github.com/cilium/cilium/pkg/ipam.NewNodeManager.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:168 +0x101
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This fixes a data race found when running the unit-tests with `-race`.

```
WARNING: DATA RACE
Write at 0x00c00288e358 by goroutine 256:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).ResyncInterfacesAndIPs()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:448 +0x8e
  github.com/cilium/cilium/pkg/ipam.(*Node).recalculate()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:352 +0xfb
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).resyncNode()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:347 +0x7a
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:394 +0x88

Previous read at 0x00c00288e358 by goroutine 416:
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerReleaseAddress()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:498 +0x1d71
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Goroutine 256 (running) created at:
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:393 +0x2cf
  github.com/cilium/cilium/pkg/ipam.NewNodeManager.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:168 +0x101
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4d1

Goroutine 416 (running) created at:
  gopkg.in/check%2ev1.(*suiteRunner).forkCall()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x44d
  gopkg.in/check%2ev1.(*suiteRunner).forkTest()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:763 +0x1b9
  gopkg.in/check%2ev1.(*suiteRunner).runTest()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:818 +0x228
  gopkg.in/check%2ev1.(*suiteRunner).run()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:624 +0x1c7
  gopkg.in/check%2ev1.Run()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:92 +0x5a
  gopkg.in/check%2ev1.RunAll()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:84 +0x127
  gopkg.in/check%2ev1.TestingT()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:72 +0x5f6
  github.com/cilium/cilium/pkg/aws/eni.Test()
      /home/chris/code/cilium/cilium/pkg/aws/eni/eni_test.go:26 +0x38
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:991 +0x1eb
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This commit creates a locked version of `eni.(*Node).getLimits()`
because `eni.(*Node).GetMaximumAllocatableIPv4()` locked the mutex at
the start of the function. This prevents a recursive lock and a
potential deadlock.

This fixes a data race found when running the unit-tests with `-race`.

```

RNING: DATA RACE
Write at 0x00c00075c180 by goroutine 33:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).UpdatedNode()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:71 +0x85
  github.com/cilium/cilium/pkg/ipam.(*Node).UpdatedResource()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:319 +0x81
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:278 +0x149
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerDefaultAllocation()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:271 +0xff1
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Previous read at 0x00c00075c180 by goroutine 77:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).GetMaximumAllocatableIPv4()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:499 +0x7e8
  github.com/cilium/cilium/pkg/ipam.(*Node).getMaxAllocate()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:213 +0x79
  github.com/cilium/cilium/pkg/ipam.(*Node).recalculate()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:368 +0x42a
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).resyncNode()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:347 +0x7a
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:394 +0x88

Goroutine 33 (running) created at:
  gopkg.in/check%2ev1.(*suiteRunner).forkCall()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x44d
  gopkg.in/check%2ev1.(*suiteRunner).forkTest()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:763 +0x1b9
  gopkg.in/check%2ev1.(*suiteRunner).runTest()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:818 +0x228
  gopkg.in/check%2ev1.(*suiteRunner).run()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:624 +0x1c7
  gopkg.in/check%2ev1.Run()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:92 +0x5a
  gopkg.in/check%2ev1.RunAll()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:84 +0x127
  gopkg.in/check%2ev1.TestingT()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:72 +0x5f6
  github.com/cilium/cilium/pkg/aws/eni.Test()
      /home/chris/code/cilium/cilium/pkg/aws/eni/eni_test.go:26 +0x38
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:991 +0x1eb

Goroutine 77 (running) created at:
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:393 +0x2cf
  github.com/cilium/cilium/pkg/ipam.NewNodeManager.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:168 +0x101
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4d1
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
In this test suite, the `metricsapi` is global variable which is shared
among all the `Test*` functions. It is possible that it becomes polluted
over time during test execution.

This is an attempt to resolve the following:

```
FAIL: node_manager_test.go:563: ENISuite.TestNodeManagerManyNodes
node_manager_test.go:602:

    c.Errorf("Node %s allocation mismatch. expected: %d allocated: %d", s.name, minAllocate, node.Stats().AvailableIPs)

... Error: Node node53 allocation mismatch. expected: 10 allocated: 18

node_manager_test.go:602:

    c.Errorf("Node %s allocation mismatch. expected: %d allocated: %d", s.name, minAllocate, node.Stats().AvailableIPs)

... Error: Node node59 allocation mismatch. expected: 10 allocated: 18

node_manager_test.go:617:
    c.Assert(metricsapi.AllocatedIPs("available"), check.Equals, numNodes*minAllocate)
... obtained int = 1016
... expected int = 1000

...
OOPS: 17 passed, 1 FAILED
--- FAIL: Test (2.60s)
FAIL
coverage: 4.3% of statements in ./...

FAIL	github.com/cilium/cilium/pkg/aws/eni	2.684s
FAIL

Makefile:204: recipe for target 'unit-tests' failed
make: *** [unit-tests] Error 1
The command "./.travis/build.sh" exited with 2.
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/fix-data-races-eni-tests branch from 9d72371 to 270b597 Compare May 27, 2020 16:33
@christarazi
Copy link
Copy Markdown
Member Author

@aanm Made the requested modifications.

@christarazi christarazi requested a review from aanm May 27, 2020 16:34
@aanm aanm merged commit 9168268 into cilium:master May 27, 2020
@christarazi christarazi deleted the pr/christarazi/fix-data-races-eni-tests branch May 27, 2020 17:09
sayboras added a commit to sayboras/cilium that referenced this pull request Jun 1, 2020
This commit is to fix the recent changes after rebased. Related PR
is cilium#11685.

Basically, there was a check if Node is nil after calling mutex lock,
which will cause nil pointer dereference. I just refactor the code
to make sure underlying node is not nil before obtaining mutex lock.

Relates to 5259ff7

Signed-off-by: Tam Mach <sayboras@yahoo.com>
aanm pushed a commit that referenced this pull request Jun 2, 2020
This commit is to fix the recent changes after rebased. Related PR
is #11685.

Basically, there was a check if Node is nil after calling mutex lock,
which will cause nil pointer dereference. I just refactor the code
to make sure underlying node is not nil before obtaining mutex lock.

Relates to 5259ff7

Signed-off-by: Tam Mach <sayboras@yahoo.com>
tklauser pushed a commit that referenced this pull request Jun 3, 2020
[ upstream commit e8a7de6 ]

This commit is to fix the recent changes after rebased. Related PR
is #11685.

Basically, there was a check if Node is nil after calling mutex lock,
which will cause nil pointer dereference. I just refactor the code
to make sure underlying node is not nil before obtaining mutex lock.

Relates to 5259ff7

Signed-off-by: Tam Mach <sayboras@yahoo.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
aanm pushed a commit that referenced this pull request Jun 4, 2020
[ upstream commit e8a7de6 ]

This commit is to fix the recent changes after rebased. Related PR
is #11685.

Basically, there was a check if Node is nil after calling mutex lock,
which will cause nil pointer dereference. I just refactor the code
to make sure underlying node is not nil before obtaining mutex lock.

Relates to 5259ff7

Signed-off-by: Tam Mach <sayboras@yahoo.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/CI Continuous Integration testing issue or flake area/eni Impacts ENI based IPAM. ci/flake This is a known failure that occurs in the tree. Please investigate me! kind/enhancement This would improve or streamline existing functionality. release-note/misc This PR makes changes that have no direct user impact.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants