Skip to content

Fix GC uid races and handling of conflicting ownerReferences#92743

Merged
k8s-ci-robot merged 13 commits into
kubernetes:masterfrom
liggitt:gc
Nov 17, 2020
Merged

Fix GC uid races and handling of conflicting ownerReferences#92743
k8s-ci-robot merged 13 commits into
kubernetes:masterfrom
liggitt:gc

Conversation

@liggitt

@liggitt liggitt commented Jul 2, 2020

Copy link
Copy Markdown
Member

Addresses issues with race conditions encountered when observing invalid ownerReferences

Fixes #65200

  • address review comments
  • add comments to each commit on the scenario being addresses
  • add unit tests for GC controller and graph builder fixes
  • ensure the coverage of the following testcases are represented:

  • child in namespace A with owner reference to namespaced type in namespace B

    • should be deleted immediately
    • event should be logged in namespace A with involvedObject of bad-child indicating the error
    • test both orders:
      • real parent added to graph, then bad child added to graph
      • bad child added to graph, virtual parent added to graph, then real parent added to graph
  • child that is cluster-scoped with owner reference to namespaced type in namespace B

    • should not be deleted
    • event should be logged in namespace kube-system with involvedObject of bad-child indicating the error
    • test both orders:
      • real parent added to graph, then bad child added to graph
      • bad child added to graph, virtual parent added to graph, then real parent added to graph
  • child pointing at non-preferred still-served apiVersion of parent object (e.g. rbac/v1beta1)

    • should not be deleted prematurely
    • should not repeatedly poll attemptToDelete while waiting
    • should be deleted when the actual parent is deleted
  • child pointing at no-longer-served apiVersion of still-existing parent object (e.g. extensions/v1beta1 deployment)

    • should not be deleted (this is indistinguishable from referencing an unknown kind/version)
    • should not repeatedly poll attemptToDelete once real parent is observed
    • TODO: follow up: should be fixed up by a controller adjusting known ownerReference API versions
  • child pointing at no-longer-served apiVersion of no-longer-existing parent object (e.g. extensions/v1beta1 deployment)

    • should not be deleted (this is indistinguishable from referencing an unknown kind/version)
    • should repeatedly poll attemptToDelete
    • should not block deletion of legitimate children of missing deployment
    • TODO: follow up: should be fixed up by a controller adjusting known ownerReference API versions and once fixed up, should be deleted since parent does not exist
  • child pointing at incorrect apiVersion/kind of still-existing parent object (e.g. core/v1 Secret with uid=123, where an apps/v1 Deployment with uid=123 exists)

    • should be deleted immediately
    • should not trigger deletion of legitimate children of parent

Follow-ups:

/kind bug
/cc @jpbetz @deads2k
/sig api-machinery

Resolves non-deterministic behavior of the garbage collection controller when ownerReferences with incorrect data are encountered. Events with a reason of `OwnerRefInvalidNamespace` are recorded when namespace mismatches between child and owner objects are detected. The [kubectl-check-ownerreferences](https://github.com/kubernetes-sigs/kubectl-check-ownerreferences) tool can be run prior to upgrading to locate existing objects with invalid ownerReferences.
* A namespaced object with an ownerReference referencing a uid of a namespaced kind which does not exist in the same namespace is now consistently treated as though that owner does not exist, and the child object is deleted.
* A cluster-scoped object with an ownerReference referencing a uid of a namespaced kind is now consistently treated as though that owner is not resolvable, and the child object is ignored by the garbage collector.

@k8s-ci-robot k8s-ci-robot requested a review from jpbetz July 2, 2020 06:52
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jul 2, 2020
@liggitt liggitt force-pushed the gc branch 3 times, most recently from 3235bec to a4b11dd Compare July 2, 2020 08:05
@liggitt

liggitt commented Jul 2, 2020

Copy link
Copy Markdown
Member Author

/retest

@liggitt

liggitt commented Jul 2, 2020

Copy link
Copy Markdown
Member Author

/skip

@fedebongio

Copy link
Copy Markdown
Contributor

/assign @jpbetz
/cc @caesarxuchao

@k8s-ci-robot k8s-ci-robot requested a review from caesarxuchao July 7, 2020 20:08
@liggitt liggitt force-pushed the gc branch 3 times, most recently from 28a133e to df55393 Compare July 9, 2020 07:16
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 9, 2020
@liggitt

liggitt commented Jul 9, 2020

Copy link
Copy Markdown
Member Author

/retest

@mattfarina

Copy link
Copy Markdown
Contributor

/retest pull-kubernetes-conformance-kind-ga-only-parallel

@k8s-ci-robot

This comment has been minimized.

@mattfarina

Copy link
Copy Markdown
Contributor

/test pull-kubernetes-conformance-kind-ga-only-parallel

@liggitt

liggitt commented Nov 17, 2020

Copy link
Copy Markdown
Member Author

docs update for 1.20 at kubernetes/website#25091

@liggitt liggitt deleted the gc branch November 17, 2020 22:41
@krmayankk

Copy link
Copy Markdown

curious why did we make the cross namespace reference an invalid case ?

@liggitt

liggitt commented Nov 18, 2020

Copy link
Copy Markdown
Member Author

why did we make the cross namespace reference an invalid case ?

namespaces are intended to be independent of each other, so cross-namespace references have not been permitted in things like ownerReferences, secret/configmap volume references, etc.

additionally, granting permissions to namespace a is not generally intended to provide visibility or ability to interact with objects from namespace b (or cause system controllers to interact with objects from namespace b).

@krmayankk

Copy link
Copy Markdown

why did we make the cross namespace reference an invalid case ?

namespaces are intended to be independent of each other, so cross-namespace references have not been permitted in things like ownerReferences, secret/configmap volume references, etc.

additionally, granting permissions to namespace a is not generally intended to provide visibility or ability to interact with objects from namespace b (or cause system controllers to interact with objects from namespace b).

thanks

@little-guy-lxr

Copy link
Copy Markdown

could this fix merge to 1.19.x ?

@liggitt

liggitt commented Sep 8, 2021

Copy link
Copy Markdown
Member Author

could this fix merge to 1.19.x ?

No, this is a significant enough rewrite of the garbagecontroller that it is limited to 1.20+

For clusters on previous versions, I would recommend running https://github.com/kubernetes-sigs/kubectl-check-ownerreferences tool is available to identify invalid ownerReferences that could trigger the bug this PR is fixing so they can be eliminated.

@aojea aojea mentioned this pull request Mar 15, 2023
harrycopa pushed a commit to harrycopa/Trident that referenced this pull request Mar 24, 2026
This PR fixes #474
Also refer: kubernetes/kubernetes#92743 (a child that is cluster-scoped with owner reference to namespaced type in namespace B), i.e. starting K8s 1.20, Kubernetes will log an event in namespace kube-system with involvedObject of bad-child indicating the error.

Today, TridentProvisioner CR (namespaced) gets set as Parent to cluster-scoped dependent objects such as clusterrole, clusterrolebinding, podsecuritypolicy, CSIDriver CR, and OCP's SCC, this confuses OCP 4.5 and above and thus OCP removes the ownerRef from the cluster scoped resources.

As part of the fix:
    1. ownerReference field is not going to be set on the cluster-scoped objects that TridentProvisioner CR creates.
    2. With this change there is no change in how TridentProvisioner CR interacts with Trident objects and autoheals them.
    3. There is also no change in the Trident uninstallation as well because Operator does not look at the ownerReferences when uninstalling Trident.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Garbage collector behavior on invalid ownerReferences for existing uids across namespaces and across kinds is non-deterministic