Skip to content

identity: Fix nil pointer panic in LookupIdentityByID#13514

Merged
aanm merged 2 commits intomasterfrom
pr/gandro/fix-identiy-allocator-nil-pointer
Oct 15, 2020
Merged

identity: Fix nil pointer panic in LookupIdentityByID#13514
aanm merged 2 commits intomasterfrom
pr/gandro/fix-identiy-allocator-nil-pointer

Conversation

@gandro
Copy link
Copy Markdown
Member

@gandro gandro commented Oct 12, 2020

Because the identity allocator is initialized asynchronously via
InitIdentityAllocator, its internal local identitiy allocator might
not have been initialized yet when the lookup functions are called.
This can cause nil pointer panics, as observed in #13479.

Before b194612, this nil pointer panic could not occur in
LookupIdentityByID as the function checked for m.IdentityAllocator != nil
which also implies m.localIdentities != nil.
This commit adds an explict check for m.localIdentities.

Fixes: #13479
Fixes: b194612 ("identity: Avoid kvstore lookup for local identities")

Signed-off-by: Sebastian Wicki sebastian@isovalent.com


Note: While the backported version of b194612 in v1.7 and v1.8 does not suffer from the nil pointer panic due to an early check fo m.IdentityAllocator != nil , I think we should still back-port this to fix the potential nil-pointer panic in LookupIdentity which is also fixed by this commit.

@gandro gandro added priority/release-blocker release-note/bug This PR fixes an issue in a previous release of Cilium. labels Oct 12, 2020
@gandro gandro requested a review from a team October 12, 2020 16:20
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 12, 2020

test-me-please

Copy link
Copy Markdown
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandro why don't we need to wait before the variable is assigned with the <-m.localIdentityAllocatorInitialized?

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 12, 2020

@gandro why don't we need to wait before the variable is assigned with the <-m.localIdentityAllocatorInitialized?

Fair point. My thinking was that I would restore the behavior present in 1.7/1.8. But assuming there are no issues with deadlocks, I guess we could rely on m.localIdentityAllocatorInitialized instead. I'll update the PR.

Edit: Overall, I don't think waiting on m.localIdentityAllocatorInitialized will change anything however. Once it becomes ready, the local identity cache will be empty, so it will also return simply nil.

@aanm
Copy link
Copy Markdown
Member

aanm commented Oct 12, 2020

@gandro why don't we need to wait before the variable is assigned with the <-m.localIdentityAllocatorInitialized?

Fair point. My thinking was that I would restore the behavior present in 1.7/1.8. But assuming there are no issues with deadlocks, I guess we could rely on m.localIdentityAllocatorInitialized instead. I'll update the PR.

Edit: Overall, I don't think waiting on m.localIdentityAllocatorInitialized will change anything however. Once it becomes ready, the local identity cache will be empty, so it will also return simply nil.

@gandro the race detector will complain that we are accessing the variable concurrently 😄

@gandro gandro force-pushed the pr/gandro/fix-identiy-allocator-nil-pointer branch from 3cb7f88 to 4566e8f Compare October 13, 2020 14:33
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 13, 2020

@aanm PTAL again. After looking a bit more closely at initialization, I decided to do a non-blocking check on the channel. This ensures that we have a proper synchronization point to avoid data races. In the process, I found some additional potential data races and nil pointer panics that I fixed.

I decided to fail the lookup instead of waiting for the caches to be initialized. If we wait for full initialization including synchronization, then both the DNS Proxy (call site) and Hubble (call site) would not be operational until we have synchronized the identity cache. I don't think we want that.

If callers need to perform a lookup on a synchronized cache, they should call WaitForInitialGlobalIdentities first.

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 13, 2020

test-me-please

@gandro gandro requested a review from aanm October 13, 2020 15:28
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 14, 2020

GKE hit a flake "could not get service LoadBalancer IP addr: 30s timeout expired" https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/2802/

This does not seem related

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 14, 2020

test-gke

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 14, 2020

Hit #13554 twice

Because the identity allocator is initialized asychronously via
`InitIdentityAllocator`, the local identitiy allocator might not have
been initialized yet when the lookup functions are called. This can
cause nil pointer panics, as observed in #13479.

Before b194612, this nil pointer panic
could not occur in `LookupIdentityByID` as the function checked for
`m.IdentityAllocator != nil` which also implies `m.localIdentities != nil`.

This commit adds an explict check for `m.localIdentities` and fixes a
potential data race by checking the initialization channels before
accessing `m.localIdentities` or `m.IdentityAllocator`.

Fixes: #13479
Fixes: b194612 ("identity: Avoid kvstore lookup for local identities")

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
The local and global allocator are initialized asynchronously in
`InitIdentityAllocator`. Therefore we must check for the initialization
channels a potential synchronization point before we can check for
`m.{localIdentities,IdentityAllocator} != nil`.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/fix-identiy-allocator-nil-pointer branch from 4566e8f to 87104cc Compare October 15, 2020 08:32
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 15, 2020

Rebased to pick up the fix the the GKE flake this observed. Rerunning CI

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 15, 2020

test-me-please

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 15, 2020

retest-net-next

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 15, 2020

@gandro
Copy link
Copy Markdown
Member Author

gandro commented Oct 15, 2020

retest-net-next

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segfault in master during lookupByID()

5 participants