Skip to content

kvtenant: properly release resources when reconnecting#106576

Draft
knz wants to merge 1 commit intocockroachdb:masterfrom
knz:20230711-tenant-close
Draft

kvtenant: properly release resources when reconnecting#106576
knz wants to merge 1 commit intocockroachdb:masterfrom
knz:20230711-tenant-close

Conversation

@knz
Copy link
Copy Markdown
Contributor

@knz knz commented Jul 11, 2023

Commit extracted from #106414.

Epic: CRDB-26691

Prior to this patch, the KV tenant connector could leak
grpc.ClientConn objects across RPC errors and re-dials. This is
because the KV tenant connector is more opinionated about "a working
RPC connection" than our rpc package and wants to choose to abandon
a RPC connection (and re-dial, possibly even a different server) even
when the rpc package considers the connection healthy because it
responds to heartbeats
.

This patch enhances the situation by more effectively closing
the low-level grpc.ClientConn when the connector decides that it
doesn't want to use it anymore.

Release note: None

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jul 11, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Jul 11, 2023

Prior to this patch, the KV tenant connector could leak
`grpc.ClientConn` objects across RPC errors and re-dials. This is
because the KV tenant connector is more opinionated about "a working
RPC connection" than our `rpc` package and wants to choose to abandon
a RPC connection (and re-dial, possibly even a different server) *even
when the `rpc` package considers the connection healthy because it
responds to heartbeats*.

This patch enhances the situation by more effectively closing
the low-level `grpc.ClientConn` when the connector decides that it
doesn't want to use it anymore.

Release note: None
@knz knz force-pushed the 20230711-tenant-close branch from 8633e20 to 503be25 Compare July 11, 2023 13:24
@knz knz requested review from aliher1911, jeffswenson and tbg July 11, 2023 13:24
@knz knz added the A-multitenancy Related to multi-tenancy label Jul 11, 2023
@tbg
Copy link
Copy Markdown
Member

tbg commented Jul 11, 2023

I'm still unclear why this is the right thing to do. How sure are we that there aren't any callers of tryForgetClient that happen "frequently" in a healthy SQL pod? In that case, closing the connection will look like random connection problems that could be really hard to track down. Are just "worried" (as we discussed on slack1) but without a concrete example to bolster it, that the address might be a load balancer and the connection is to the wrong cluster, but won't fail because the connection is unvalidated? Or is this PR fixing an actual observed issue (link?)

My decision tree as a reviewer would be the following:

  • if this fixes a real bug, we can proceed since this is a simple fix, under the condition that we also log a warning when we close a connection (to avoid the "it looks like a weird network issue" case); I don't want to see this in KV L2.
  • if it doesn't, should we discuss moving off GRPCUnvalidatedDial, and perhaps bypass the rpc.Context entirely? After all, this is just a single connection that is being maintained (per pod); we can call c.rpcContext.GRPCDialOptions() instead and manage our own raw ClientConn which we are then of course free to close. Or, we add GRPCDialPodToNode which results in a connection that verifies cluster name, or something like that. That latter option integrates with the existing metrics, etc, but is more complex, so I am leaning towards the former.

Footnotes

  1. https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1689066956978959 (internal)

@tbg tbg removed their request for review July 25, 2023 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-multitenancy Related to multi-tenancy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants