Skip to content

OCPBUGS-10924: Switch default SA to machine-config-operator#3740

Merged
openshift-merge-robot merged 1 commit into
openshift:masterfrom
cdoern:serviceacct
Jun 28, 2023
Merged

OCPBUGS-10924: Switch default SA to machine-config-operator#3740
openshift-merge-robot merged 1 commit into
openshift:masterfrom
cdoern:serviceacct

Conversation

@cdoern

@cdoern cdoern commented Jun 8, 2023

Copy link
Copy Markdown
Contributor

create the new account, tombstone the old one, and update all references.

All tests should work the same as proof that this change does not impact functionality.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 8, 2023
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@cdoern: This pull request references Jira Issue OCPBUGS-10924, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

create the new account, tombstone the old one, and update all references.

All tests should work the same as proof that this change does not impact functionality.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 8, 2023
@openshift-ci openshift-ci Bot requested review from cgwalters and sinnykumari June 8, 2023 18:13
@cdoern

cdoern commented Jun 8, 2023

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 8, 2023
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@cdoern: This pull request references Jira Issue OCPBUGS-10924, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci Bot requested a review from rioliu-rh June 8, 2023 18:16
@rioliu-rh

Copy link
Copy Markdown

/cc @sergiordlr

@openshift-ci openshift-ci Bot requested a review from sergiordlr June 8, 2023 23:51
@rioliu-rh

Copy link
Copy Markdown

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 8, 2023
@cdoern cdoern force-pushed the serviceacct branch 2 times, most recently from 3a4f1f1 to 37a0aba Compare June 9, 2023 13:24
@cdoern

cdoern commented Jun 9, 2023

Copy link
Copy Markdown
Contributor Author

/retest-required

@sergiordlr

sergiordlr commented Jun 9, 2023

Copy link
Copy Markdown
Contributor
  1. Using IPI on AWS. The operator pod is not using the "default" service account
$  oc get pods -l k8s-app=machine-config-operator -o yaml |grep serviceAcc
    serviceAccount: machine-config-operator
    serviceAccountName: machine-config-operator
        - serviceAccountToken:

The following test cases were executed and passed:

"[sig-mco] MCO Author:sregidor-NonPreRelease-High-45239-KubeletConfig has a limit of 10 per cluster [Disruptive] [Serial]"
"[sig-mco] MCO Author:sregidor-NonPreRelease-Medium-57009-Kubeletconfig custom tlsSecurityProfile [Disruptive] [Serial]"
"[sig-mco] MCO Author:mhanss-Longduration-NonPreRelease-Critical-42369-add container runtime config [Disruptive] [Serial]"
"[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-LongDuration-High-63894-Scaleup using 4.1 cloud image[Disruptive] [Serial]"
"[sig-mco] MCO Author:sregidor-NonHyperShiftHOST-NonPreRelease-Critical-62084-Certificate rotation in paused pools[Disruptive] [Serial]"
"[sig-mco] MCO Author:sregidor-NonPreRelease-Medium-52373-Modify proxy configuration in paused pools [Disruptive] [Serial]"
"[sig-mco] MCO Author:sregidor-NonHyperShiftHOST-NonPreRelease-Critical-62084-Certificate rotation in paused pools[Disruptive] [Serial]"
"[sig-mco] MCO Author:sregidor-Longduration-NonPreRelease-High-47009-Config Drift. New Service Unit. [Serial]"

  1. Upgrade using IPI on IBM:

Before upgrade:

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2023-06-08-094044   True        False         51m     Cluster version is 4.13.0-0.nightly-2023-06-08-094044

$ oc get clusterrolebinding default-account-openshift-machine-config-operator
NAME                                                ROLE                        AGE
default-account-openshift-machine-config-operator   ClusterRole/cluster-admin   76m

$ oc get pods -l k8s-app=machine-config-operator -o yaml |grep serviceAcc
    serviceAccount: default
    serviceAccountName: default
        - serviceAccountToken:

After upgrade:

The clusterrolebindings are duplicated:

$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.ci.test-2023-06-09-134650-ci-ln-hyqnmx2-latest   True        False         49s     Cluster version is 4.14.0-0.ci.test-2023-06-09-134650-ci-ln-hyqnmx2-latest

$ oc get clusterversion -o yaml
...
    history:
    - acceptedRisks: |-
        Target release version="" image="registry.build05.ci.openshift.org/ci-ln-hyqnmx2/release:latest" cannot be verified, but continuing anyway because the update was forced: release images that are not accessed via digest cannot be verified
        Forced through blocking failures: Multiple precondition checks failed:
        * Precondition "EtcdRecentBackup" failed because of "ControllerStarted": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
        * Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.13.0-0.nightly-2023-06-08-094044 to 4.14.0-0.ci.test-2023-06-09-134650-ci-ln-hyqnmx2-latest is unknown.
      completionTime: "2023-06-09T15:07:42Z"
      image: registry.build05.ci.openshift.org/ci-ln-hyqnmx2/release:latest
      startedTime: "2023-06-09T13:59:57Z"
      state: Completed
      verified: false
      version: 4.14.0-0.ci.test-2023-06-09-134650-ci-ln-hyqnmx2-latest
    - completionTime: "2023-06-09T12:51:18Z"
      image: registry.ci.openshift.org/ocp/release@sha256:6741ff6405e341b80c66d246e2514b21df8983015895289fbb4937ff77c8bdc0
      startedTime: "2023-06-09T12:25:38Z"
      state: Completed
      verified: false
      version: 4.13.0-0.nightly-2023-06-08-094044
    observedGeneration: 3
    versionHash: 0l0ye0VOXOI=





$  oc get clusterrolebinding custom-account-openshift-machine-config-operator default-account-openshift-machine-config-operator
NAME                                                ROLE                        AGE
custom-account-openshift-machine-config-operator    ClusterRole/cluster-admin   25m
default-account-openshift-machine-config-operator   ClusterRole/cluster-admin   161m

I don't know if it is related to the problem, but we can see this in the CVO pod logs

$ oc logs  -n openshift-cluster-version cluster-version-operator-6f74c896fb-jnbh6 | grep default-account-openshift-machine-config-operator
I0609 15:00:10.114945       1 payload.go:210] excluding 0000_80_machine-config-operator_03_rbac.yaml group=rbac.authorization.k8s.io kind=ClusterRoleBinding namespace= name=default-account-openshift-machine-config-operator: unrecognized value include.release.openshift.io/self-managed-high-availability=false
I0609 15:00:12.385788       1 payload.go:210] excluding 0000_80_machine-config-operator_03_rbac.yaml group=rbac.authorization.k8s.io kind=ClusterRoleBinding namespace= name=default-account-openshift-machine-config-operator: unrecognized value include.release.openshift.io/self-managed-high-availability=false

I will add a must-gather in a comment in the jira issue.

default-account-openshift-machine-config-operator clusterrolebinding was not removed, hence we can't add the qe-approved label.

create the new account, tombstone the old one, and update all references.

Signed-off-by: Charlie Doern <cdoern@redhat.com>
@sergiordlr

Copy link
Copy Markdown
Contributor
  1. Using IPI on AWS. The operator pod is not using the "default" service account
$ oc get pods -l k8s-app=machine-config-operator -o yaml -n openshift-machine-config-operator |grep serviceAcc
    serviceAccount: machine-config-operator
    serviceAccountName: machine-config-operator
        - serviceAccountToken:

Only custom-account-openshift-machine-config-operator clusterrolebinding is created (we don't create default-account-openshift-machine-config-operator)

$ oc get clusterrolebinding |grep machine-config-operator
custom-account-openshift-machine-config-operator                            ClusterRole/cluster-admin                                                               38m

The following test cases were executed and passed:

"[sig-mco] MCO Author:sregidor-NonPreRelease-High-45239-KubeletConfig has a limit of 10 per cluster [Disruptive] [Serial]"
"[sig-mco] MCO Author:sregidor-NonPreRelease-Medium-57009-Kubeletconfig custom tlsSecurityProfile [Disruptive] [Serial]"
"[sig-mco] MCO Author:mhanss-Longduration-NonPreRelease-Critical-42369-add container runtime config [Disruptive] [Serial]"
"[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-LongDuration-High-63894-Scaleup using 4.1 cloud image[Disruptive] [Serial]"
"[sig-mco] MCO Author:sregidor-NonHyperShiftHOST-NonPreRelease-Critical-62084-Certificate rotation in paused pools[Disruptive] [Serial]"
"[sig-mco] MCO Author:sregidor-NonPreRelease-Medium-52373-Modify proxy configuration in paused pools [Disruptive] [Serial]"
"[sig-mco] MCO Author:sregidor-NonHyperShiftHOST-NonPreRelease-Critical-62084-Certificate rotation in paused pools[Disruptive] [Serial]"
"[sig-mco] MCO Author:sregidor-Longduration-NonPreRelease-High-47009-Config Drift. New Service Unit. [Serial]"
"[sig-mco] MCO Layering Author:sregidor-ConnectedOnly-VMonly-Longduration-NonPreRelease-Critical-54085-Update osImage changing /etc /usr and rpm [Disruptive] [Serial]"

  1. Upgrade using IPI on IBM:

Before upgrade the default SA is used by machine-config-operator pod:

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2023-06-20-224158   True        False         15m     Cluster version is 4.13.0-0.nightly-2023-06-20-224158

$ oc get pods -l k8s-app=machine-config-operator -o yaml -n openshift-machine-config-operator |grep serviceAcc
    serviceAccount: default
    serviceAccountName: default
        - serviceAccountToken:

The upgrade is executed without problems

$ oc get clusterversion -o yaml
....
    history:
    - acceptedRisks: |-
        Target release version="" image="registry.build01.ci.openshift.org/ci-ln-4hs2vmb/release:latest" cannot be verified, but continuing anyway because the update was forced: release images that are not accessed via digest cannot be verified
        Forced through blocking failures: Multiple precondition checks failed:
        * Precondition "EtcdRecentBackup" failed because of "ControllerStarted": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
        * Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.13.0-0.nightly-2023-06-20-224158 to 4.14.0-0.ci.test-2023-06-26-074430-ci-ln-4hs2vmb-latest is unknown.
      completionTime: "2023-06-26T09:52:36Z"
      image: registry.build01.ci.openshift.org/ci-ln-4hs2vmb/release:latest
      startedTime: "2023-06-26T08:37:24Z"
      state: Completed
      verified: false
      version: 4.14.0-0.ci.test-2023-06-26-074430-ci-ln-4hs2vmb-latest
    - completionTime: "2023-06-26T08:21:44Z"
      image: registry.ci.openshift.org/ocp/release@sha256:127ebebfe1b39b05186781d1a926cab8f06d0599febc30697aa7a34c1e7ccc36
      startedTime: "2023-06-26T07:58:57Z"
      state: Completed
      verified: false
      version: 4.13.0-0.nightly-2023-06-20-224158

After upgrade:

The clusterrolebinding for the default account is removed, and a new one is created for the custom account. The operator pod is not using the default SA:

$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.ci.test-2023-06-26-074430-ci-ln-4hs2vmb-latest   True        False         14m     Cluster version is 4.14.0-0.ci.test-2023-06-26-074430-ci-ln-4hs2vmb-latest

$ oc get clusterrolebinding custom-account-openshift-machine-config-operator default-account-openshift-machine-config-operator
NAME                                               ROLE                        AGE
custom-account-openshift-machine-config-operator   ClusterRole/cluster-admin   44m
Error from server (NotFound): clusterrolebindings.rbac.authorization.k8s.io "default-account-openshift-machine-config-operator" not found

$ oc get pods -l k8s-app=machine-config-operator -o yaml -n openshift-machine-config-operator |grep serviceAcc
    serviceAccount: machine-config-operator
    serviceAccountName: machine-config-operator
        - serviceAccountToken:

After the upgrade we could execute "[sig-mco] MCO Author:sregidor-Longduration-NonPreRelease-High-47045-Config Drift. Compressed files. [Serial]" test case, and the test case passed.

We can add the qe-approved label

/label qe-approved

@openshift-ci openshift-ci Bot added the qe-approved Signifies that QE has signed off on this PR label Jun 26, 2023
@cdoern

cdoern commented Jun 26, 2023

Copy link
Copy Markdown
Contributor Author

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2023
@cdoern

cdoern commented Jun 26, 2023

Copy link
Copy Markdown
Contributor Author

/retest-required

@openshift-ci

openshift-ci Bot commented Jun 26, 2023

Copy link
Copy Markdown
Contributor

@cdoern: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn ace637f link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cdoern

cdoern commented Jun 28, 2023

Copy link
Copy Markdown
Contributor Author

/retest-required

@djoshy

djoshy commented Jun 28, 2023

Copy link
Copy Markdown
Contributor

/lgtm Thanks for the fix - let's pray to the hypershift gods 🙏

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2023
@djoshy

djoshy commented Jun 28, 2023

Copy link
Copy Markdown
Contributor

Strange, it applied approved but not lgtm 🤔

@sinnykumari

Copy link
Copy Markdown
Contributor

/lgtm

@sinnykumari

Copy link
Copy Markdown
Contributor

perhaps some issue with bot

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2023
@openshift-ci

openshift-ci Bot commented Jun 28, 2023

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cdoern, djoshy, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cdoern,djoshy,sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 63eb794 into openshift:master Jun 28, 2023
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@cdoern: Jira Issue OCPBUGS-10924: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-10924 has been moved to the MODIFIED state.

Details

In response to this:

create the new account, tombstone the old one, and update all references.

All tests should work the same as proof that this change does not impact functionality.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/machine-config-operator that referenced this pull request Oct 16, 2023
ace637f (OCPBUGS-10924: Switch default SA to
machine-config-operator, 2023-06-23, openshift#3740) moved the 4.14
machine-config operator to a non-default ServiceAccount and
ClusterRoleBinding.  But 4.13 and earlier remain on the default
ServiceAccount.

1cdb75f (install: Recreate and delayed default ServiceAccount
deletion, 2023-09-19, openshift#3923, OCPBUGS-19400) brought Recreate logic
back to 4.13.14 [1] and later (good), but also brought back a 'delete'
manifest for the default ClusterRoleBinding, which leads to the 4.13
cluster-version operator fighting with itself over whether that
ClusterRoleBinding should exist (it should exist on 4.13).  For
example, [2] updates from 4.12.36 to 4.13.14, and has:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.conditions[] | select(.type == "Upgradeable") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2023-09-28T17:09:41Z Upgradeable=False ResourceDeletesInProgress: Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=clusterrolebinding "default-account-openshift-machine-config-operator"

By dropping the deletion manifest from 4.13, we avoid contention
between two manifests, and leave the default ClusterRoleBinding alone
until a later update to 4.14 will remove it.

[1]:  https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.13.14
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904
wking added a commit to wking/machine-config-operator that referenced this pull request Oct 16, 2023
ace637f (OCPBUGS-10924: Switch default SA to
machine-config-operator, 2023-06-23, openshift#3740) moved the 4.14
machine-config operator to a non-default ServiceAccount and
ClusterRoleBinding.  But 4.13 and earlier remain on the default
ServiceAccount.

1cdb75f (install: Recreate and delayed default ServiceAccount
deletion, 2023-09-19, openshift#3923, OCPBUGS-19400) brought Recreate logic
back to 4.13.14 [1] and later (good), but also brought back a 'delete'
manifest for the default ClusterRoleBinding, which leads to the 4.13
cluster-version operator fighting with itself over whether that
ClusterRoleBinding should exist (it should exist on 4.13) [2].  For
example, [3] updates from 4.12.36 to 4.13.14, and has:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.conditions[] | select(.type == "Upgradeable") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2023-09-28T17:09:41Z Upgradeable=False ResourceDeletesInProgress: Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=clusterrolebinding "default-account-openshift-machine-config-operator"

By dropping the deletion manifest from 4.13, we avoid contention
between two manifests, and leave the default ClusterRoleBinding alone
until a later update to 4.14 will remove it.

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.13.14
[2]: https://issues.redhat.com/browse/OCPBUGS-10924
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904
wking added a commit to wking/machine-config-operator that referenced this pull request Oct 16, 2023
ace637f (OCPBUGS-10924: Switch default SA to
machine-config-operator, 2023-06-23, openshift#3740) moved the 4.14
machine-config operator to a non-default ServiceAccount and
ClusterRoleBinding.  But 4.13 and earlier remain on the default
ServiceAccount.

1cdb75f (install: Recreate and delayed default ServiceAccount
deletion, 2023-09-19, openshift#3923, OCPBUGS-19400) brought Recreate logic
back to 4.13.14 [1] and later (good), but also brought back a 'delete'
manifest for the default ClusterRoleBinding, which leads to the 4.13
cluster-version operator fighting with itself over whether that
ClusterRoleBinding should exist (it should exist on 4.13) [2].  For
example, [3] updates from 4.12.36 to 4.13.14, and has:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.conditions[] | select(.type == "Upgradeable") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2023-09-28T17:09:41Z Upgradeable=False ResourceDeletesInProgress: Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=clusterrolebinding "default-account-openshift-machine-config-operator"

By dropping the deletion manifest from 4.13, we avoid contention
between two manifests, and leave the default ClusterRoleBinding alone
until a later update to 4.14 will remove it.

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.13.14
[2]: https://issues.redhat.com/browse/OCPBUGS-21721
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904
ptalgulk01 pushed a commit to ptalgulk01/machine-config-operator that referenced this pull request May 15, 2026
OCPBUGS-10924: Switch default SA to machine-config-operator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants