Skip to content

fix: handle ALREADY_EXISTS in TagKey and TagValue controllers#6943

Merged
cheftako merged 8 commits into
GoogleCloudPlatform:masterfrom
codebot-robot:fix/tags-acquire-existing-on-already-exists
Mar 18, 2026
Merged

fix: handle ALREADY_EXISTS in TagKey and TagValue controllers#6943
cheftako merged 8 commits into
GoogleCloudPlatform:masterfrom
codebot-robot:fix/tags-acquire-existing-on-already-exists

Conversation

@codebot-robot

Copy link
Copy Markdown
Collaborator

This PR handles the ALREADY_EXISTS error in TagKey and TagValue controllers by acquiring the existing resource.

It incorporates changes from #6890 by @afarbos, updated with:

  • Latest master rebase.
  • MockGCP normalization for TagKey/TagValue.
  • E2E tests for acquisition scenarios.

Fixes #6942
Supersedes #6890

…le values

Narrowed .etag, .createTime, and .updateTime normalization to only affect
cloudresourcemanager v3 APIs (Tags, operations) and skip already stable dummy
values like 'abcdef0123A='. This fixes regressions in unrelated tests
(e.g. Cloud Run) that were broken by global replacement of common dummy etags.
@codebot-robot

Copy link
Copy Markdown
Collaborator Author

--- INVESTIGATION REPORT ---
Run ID: 22803199004
Name: ci-presubmit
Cause: Code Error
Details: Although global log corruption was fixed by moving to path-based normalization in previous commits, the 'tests-e2e-fixtures-tags' suite continued to fail. This was due to two issues: (1) path-based replacements in 'tests/e2e/normalize.go' do not support nested slices (e.g., '.tagKeys[].etag'), leaving these fields unnormalized in List responses; and (2) new test fixtures added in this PR used incorrect '${etag}' placeholders in their golden files.
Action Taken:

  1. Moved 'Tags' specific normalization back to 'mockgcp/mockresourcemanager/normalize.go' using 'ConfigureVisitor' and 'ReplacePath', which correctly handles nested slices and is properly scoped to CRM.
  2. Updated 'TagKey' and 'TagValue' acquisition test golden files to use standard 'abcdef0123A=' and '2024-04-01T12:34:56.123456Z' values instead of placeholders.
  3. Removed broad '.createTime' and '.updateTime' replacements from 'tests/e2e/normalize.go' to avoid side effects in other services.

@codebot-robot

Copy link
Copy Markdown
Collaborator Author

--- INVESTIGATION REPORT ---
The workflow failures identified in the logs for PR #6890 have been analyzed and found to be addressed in this PR (#6943).

1. Tags (TagKey/TagValue) Acquisition Failures (Run ID: 22803199004)

  • Cause: Missing normalization for createTime, updateTime, and etag in the newly added tests, and the use of ${createTime} placeholders which were not handled by mockgcp.
  • Fix: This PR includes scoped normalization rules in mockresourcemanager/normalize.go and updated golden files with stable dummy values. I have verified locally that tagkeyacquire and tagvalueacquire now pass.

2. Alloydb Failure (Run ID: 22783928237)

  • Cause: Global normalization rules added for Tags in previous iterations were corrupting traffic logs of unrelated tests (like Alloydb) by replacing stable timestamps with ${createTime}.
  • Fix: This PR narrows the normalization scope to only affect the cloudresourcemanager service. I have verified locally that the Alloydb tests now pass.

3. Smoketest Failure (Run ID: 22783928237)

  • Cause: Timeout during StorageBucket reconciliation, which is a known infrastructure/test flake in the CI environment and unrelated to the code changes.

Current Status:
All identified failures are resolved in the current state of this PR. Tests for Tags acquisition and Alloydb are passing successfully.

@cheftako cheftako left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things to look at

}

apiError := &apierror.APIError{}
return errors.As(err, &apiError) && apiError.HTTPCode() == 409 ||

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operator-precedence trap in IsAlreadyExists —

return errors.As(err, &apiError) && apiError.HTTPCode() == 409 ||
grpcStatus.Code(err) == grpcCode.AlreadyExists

Due to && binding tighter than ||, the logic is actually:
return (errors.As(err, &apiError) && apiError.HTTPCode() == 409) ||
grpcStatus.Code(err) == grpcCode.AlreadyExists

That happens to be the intended semantics, but it's a readability trap. More
importantly, the sibling HasHTTPCode uses a guarded if errors.As(...) block,
which is the established pattern in this file. The new function is
inconsistent. Add parentheses and align the pattern with HasHTTPCode for
clarity.

Also: HasHTTPCode has a secondary fallback for HTTPCode() == -1, which
IsAlreadyExists omits without explanation.

name: "wrapped gRPC AlreadyExists",
err: fmt.Errorf("creating resource: %w", status.Error(codes.AlreadyExists, "already exists")),
want: true,
},

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No test for HTTP 409 REST path —

TestIsAlreadyExists covers gRPC paths but not the REST/HTTP 409 path directly.
Most GCP REST-transport clients return an *apierror.APIError with HTTPCode()
== 409 and no gRPC code — the primary production path for many resources. This
should be covered.

func (a *TagsTagKeyAdapter) acquireExistingTagKey(ctx context.Context, createOp *directbase.CreateOperation) error {
log := klog.FromContext(ctx)

existing, err := a.findTagKeyByShortName(ctx)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU race in acquire logic —

The acquire sequence is:

  1. CreateTagKey → ALREADY_EXISTS
  2. findTagKeyByShortName → list + search

Between steps 1 and 2, the resource could be deleted by another process,
causing findTagKeyByShortName to return nil and leaving the KCC object in a
permanently-errored state. Per standard Kubernetes controller patterns, this
should return a transient/retryable error rather than a fatal one, so the
controller runtime will re-queue.

if service == "" {
service = "cloudresourcemanager.googleapis.com"
}
return "//" + service + "/" + strings.Join(tokens, "/"), nil

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalizeParent behavioral change —

Before this PR, normalizeParent("projects/myproject") returned
"projects/123456" (bare path). After, it returns
"//cloudresourcemanager.googleapis.com/projects/123456". Existing TagBinding
objects in the mock stored with bare paths would fail the equality check in
ListTagBindings. The controller code appears to always pass //-prefixed
resource links so it may be fine in practice, but this should be documented
and the expected input format of normalizeParent should be clearly stated.

path := parent
if strings.HasPrefix(parent, "//") {
parts := strings.SplitN(parent[2:], "/", 2)
if len(parts) == 2 {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalizeParent malformed input silently falls through —
mockgcp/mockresourcemanager/tagbinding.go (Low)

if strings.HasPrefix(parent, "//") {
parts := strings.SplitN(parent[2:], "/", 2)
if len(parts) == 2 {
service = parts[0]
path = parts[1]
}
}

A malformed input like "//cloudresourcemanager.googleapis.com" (no path)
produces len(parts) == 1, so service stays "" and path stays as the original
value — leading to a silent wrong tokenization. The mock should at least log
or return an error for malformed input.

@afarbos

afarbos commented Mar 17, 2026

Copy link
Copy Markdown
Contributor

@codebot-robot could you address the comments?

@afarbos

afarbos commented Mar 18, 2026

Copy link
Copy Markdown
Contributor

@fkc1e100 Do you know how to address the comments here?

@cheftako cheftako left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 outstanding comments. I think best way to proceed is in a follow up.

@cheftako cheftako added this pull request to the merge queue Mar 18, 2026
@google-oss-prow google-oss-prow Bot added the lgtm label Mar 18, 2026
@google-oss-prow

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Merged via the queue into GoogleCloudPlatform:master with commit 05d0707 Mar 18, 2026
167 checks passed
@cheftako cheftako added this to the 1.148 milestone Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TagsTagKey and TagsTagValue controllers do not handle ALREADY_EXISTS in multi-cluster setups

3 participants