Skip to content

sql: disable test tenant for tests using TestingDisableTableLeases#166108

Merged
trunk-io[bot] merged 1 commit intocockroachdb:masterfrom
mw5h:fix-descriptor-mutation-flake
Mar 25, 2026
Merged

sql: disable test tenant for tests using TestingDisableTableLeases#166108
trunk-io[bot] merged 1 commit intocockroachdb:masterfrom
mw5h:fix-descriptor-mutation-flake

Conversation

@mw5h
Copy link
Copy Markdown
Contributor

@mw5h mw5h commented Mar 18, 2026

Summary

Several tests directly manipulate table descriptors via KV writes and rely on
lease.TestingDisableTableLeases() to make those changes immediately visible
to the SQL layer. This process-global atomic bool only takes effect within the
test process; when an external process virtual cluster is metamorphically
injected, the external SQL server's lease manager still caches descriptors
normally and never sees the direct KV writes, causing intermittent failures.

This PR disables the default test tenant for all tests that combine
TestingDisableTableLeases with direct KV descriptor manipulation:

  • descriptor_mutation_test.go: 5 tests
  • index_mutation_test.go: 1 test
  • delete_preserving_index_test.go: 3 test sites
  • crdb_internal_test.go: 2 tests

Fixes: #165980

Release note: None

🤖 Generated with Claude Code

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Mar 18, 2026

😎 Merged successfully - details.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@mw5h mw5h marked this pull request as ready for review March 19, 2026 23:02
@mw5h mw5h requested review from a team and ZhouXing19 and removed request for a team March 19, 2026 23:02
@mw5h mw5h added backport Label PR's that are backports to older release branches backport-all Flags PRs that need to be backported to all supported release branches and removed backport Label PR's that are backports to older release branches labels Mar 19, 2026
Copy link
Copy Markdown
Collaborator

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Just one nit comment about the issue to link.
I'm also curious why this test failure happens recently -- seems that this is an old test, and the introduction of default test tenant / VC injection is not new either (since 2023, IIUC). Is it purely because the injection of a default tenant is metamorphic, or something changes recently? I asked glean but got nothing convincing.

@ZhouXing19 reviewed 4 files and all commit messages, and made 2 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on mw5h and yuzefovich).


pkg/sql/crdb_internal_test.go line 217 at r1 (raw file):

		// This is incompatible with external process virtual clusters where
		// the lease disable doesn't take effect.
		DefaultTestTenant: base.TestIsForStuffThatShouldWorkWithSecondaryTenantsButDoesntYet(165980),

nit: I'm not sure using the issue number of the original test failure is conventional here (maybe @yuzefovich knows better), but I think we should have it linked to a "root cause" issue, which is about "tests that mutate descriptors via KV + TestingDisableTableLeases flaking when an external-process virtual cluster is injected". Also, we might want to add this root issue to #76378?

Copy link
Copy Markdown
Contributor Author

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question on the "why now?" — I dug into it and the answer is straightforward.

Before October 2025, the entire pkg/sql test package had a package-wide override in main_test.go that disabled test tenant injection for all tests:

defer serverutils.TestingSetDefaultTenantSelectionOverride(
    base.TestIsForStuffThatShouldWorkWithSecondaryTenantsButDoesntYet(76378),
)()

In 3f64b0eb656 (Oct 22, 2025), @yuzefovich removed that package-wide override and enabled test tenants for pkg/sql, individually fixing tests he identified as incompatible. These tests weren't caught in that pass. Since tenant injection is metamorphic, they only fail when the random coin flip injects an external-process tenant, which is why it took months to surface as flaky failures.

Re: the issue number — good point, I'll file a root cause issue for "tests using TestingDisableTableLeases + direct KV descriptor manipulation are incompatible with external-process virtual clusters" and link it to #76378, then update the references in this PR.

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :lgtm:

@yuzefovich reviewed 4 files and all commit messages, and made 3 comments.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on mw5h and ZhouXing19).


pkg/sql/crdb_internal_test.go line 217 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

nit: I'm not sure using the issue number of the original test failure is conventional here (maybe @yuzefovich knows better), but I think we should have it linked to a "root cause" issue, which is about "tests that mutate descriptors via KV + TestingDisableTableLeases flaking when an external-process virtual cluster is injected". Also, we might want to add this root issue to #76378?

Yeah, this is the way - filing a separate issue to this particular problem and then link this new issue in the meta issue #76378.


pkg/sql/crdb_internal_test.go line 439 at r2 (raw file):

		// This is incompatible with external process virtual clusters where
		// the lease disable doesn't take effect.
		DefaultTestTenant: base.TestIsForStuffThatShouldWorkWithSecondaryTenantsButDoesntYet(166311),

nit: I think in the shared process these tests should work, so it might be worth creating TestDoesNotWorkWithExternalProcessModeButWeDontKnowWhyYet helper and using it this PR.

Several tests directly manipulate table descriptors via KV writes and
rely on `lease.TestingDisableTableLeases()` to make those changes
immediately visible to the SQL layer. This process-global atomic bool
only takes effect within the test process; when an external process
virtual cluster is metamorphically injected, the external SQL server's
lease manager still caches descriptors normally and never sees the
direct KV writes. This causes intermittent failures where the SQL layer
uses stale descriptors.

The reported failure in TestOperationsWithIndexMutation manifests as a
missing row after UPSERT: the previous subtest's direct KV descriptor
writes (8+ version bumps) leave stale lease state in the external
process, and after the table is dropped and recreated, the first UPSERT
can silently target stale state.

Fix this by setting `DefaultTestTenant` to disable external process
virtual cluster injection in all tests that combine
`TestingDisableTableLeases` with direct KV descriptor manipulation.
Since shared-process mode works fine (same process, same atomic bool),
we use `TestDoesNotWorkWithExternalProcessMode` rather than disabling
all secondary tenant modes.

These tests were exposed when the package-wide tenant injection override
was removed in 3f64b0e (Oct 2025). Since injection is metamorphic,
failures are intermittent.

Fixes: cockroachdb#165980
Informs: cockroachdb#166311

Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mw5h mw5h force-pushed the fix-descriptor-mutation-flake branch from 7d8815d to e268efd Compare March 23, 2026 23:46
Copy link
Copy Markdown
Contributor Author

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mw5h made 1 comment and resolved 1 discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on yuzefovich).


pkg/sql/crdb_internal_test.go line 439 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I think in the shared process these tests should work, so it might be worth creating TestDoesNotWorkWithExternalProcessModeButWeDontKnowWhyYet helper and using it this PR.

Turns out that this already exists!

@mw5h mw5h added backport-26.1.x Flags PRs that need to be backported to 26.1 backport-26.2.x Flags PRs that need to be backported to 26.2 and removed backport-all Flags PRs that need to be backported to all supported release branches labels Mar 25, 2026
@trunk-io trunk-io bot merged commit 6fea251 into cockroachdb:master Mar 25, 2026
28 checks passed
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Mar 25, 2026

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #165980: branch-release-26.1, branch-release-26.2.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-26.1.x Flags PRs that need to be backported to 26.1 backport-26.2.x Flags PRs that need to be backported to 26.2 target-release-26.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: TestOperationsWithIndexMutation failed

4 participants