-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Logical Data Replication Does Not Support External Process Multitenancy #134857
Copy link
Copy link
Open
Labels
A-cross-cluster-replicationRelated to cross-cluster replication (PCR or LDR)Related to cross-cluster replication (PCR or LDR)A-disaster-recoveryA-multitenancyRelated to multi-tenancyRelated to multi-tenancyP-3Issues/test failures with no fix SLAIssues/test failures with no fix SLAT-disaster-recoverybranch-masterFailures and bugs on the master branch.Failures and bugs on the master branch.
Description
When running on external process tenants, LDR fails with an error that looks like:
I241111 14:48:19.864630 7239 jobs/registry.go:1599 [T10,Vcluster-10,nsql1] 768 REPLICATION STREAM PRODUCER job 1019956658274861057: stepping through state running
...
I241111 14:48:19.909567 7446 ccl/crosscluster/logical/logical_replication_job.go:611 [T10,Vcluster-10,nsql1,job=LOGICAL REPLICATION id=1019956658305499137] 820 hit retryable error subscription: ERROR: job with ID 1019956658274861057 does not exist (SQLSTATE XXUUU)
...
This error occurs because DistSQLPlanner.GetSQLInstanceInfo is using an API that only returns information about KV nodes.
So at a high level what is happening is:
- LDR connects to the remote tenant to get a plan for replication.
- The remote tenant maps spans to sql instances and returns addresses for kv nodes instead of the sql servers.
- The LDR client attempts to dial the KV nodes. The KV cluster is missing the LDR event producer job, so the job gets stuck in a loop.
Here's a minimal diff that allows LDR to work in external process tenants.
diff --git a/pkg/sql/distsql_physical_planner.go b/pkg/sql/distsql_physical_planner.go
index 7f0ccf565c3..36d13477874 100644
--- a/pkg/sql/distsql_physical_planner.go
+++ b/pkg/sql/distsql_physical_planner.go
@@ -51,6 +51,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/sql/sqlerrors"
"github.com/cockroachdb/cockroach/pkg/sql/sqlinstance"
"github.com/cockroachdb/cockroach/pkg/sql/types"
+ "github.com/cockroachdb/cockroach/pkg/util"
"github.com/cockroachdb/cockroach/pkg/util/encoding"
"github.com/cockroachdb/cockroach/pkg/util/hlc"
"github.com/cockroachdb/cockroach/pkg/util/intsets"
@@ -251,7 +252,17 @@ func (dsp *DistSQLPlanner) GetAllInstancesByLocality(
func (dsp *DistSQLPlanner) GetSQLInstanceInfo(
sqlInstanceID base.SQLInstanceID,
) (*roachpb.NodeDescriptor, error) {
- return dsp.nodeDescs.GetNodeDescriptor(roachpb.NodeID(sqlInstanceID))
+ instance, err := dsp.sqlAddressResolver.GetInstance(context.Background(), sqlInstanceID)
+ if err != nil {
+ return nil, err
+ }
+ return &roachpb.NodeDescriptor {
+ SQLAddress: util.UnresolvedAddr {
+ NetworkField: "tcp",
+ AddressField: instance.InstanceSQLAddr,
+ },
+ Locality: instance.Locality,
+ }, nil
}
Jira issue: CRDB-44283
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
A-cross-cluster-replicationRelated to cross-cluster replication (PCR or LDR)Related to cross-cluster replication (PCR or LDR)A-disaster-recoveryA-multitenancyRelated to multi-tenancyRelated to multi-tenancyP-3Issues/test failures with no fix SLAIssues/test failures with no fix SLAT-disaster-recoverybranch-masterFailures and bugs on the master branch.Failures and bugs on the master branch.