Skip to content

Logical Data Replication Does Not Support External Process Multitenancy #134857

@jeffswenson

Description

@jeffswenson

When running on external process tenants, LDR fails with an error that looks like:

I241111 14:48:19.864630 7239 jobs/registry.go:1599  [T10,Vcluster-10,nsql1] 768  REPLICATION STREAM PRODUCER job 1019956658274861057: stepping through state running
...
I241111 14:48:19.909567 7446 ccl/crosscluster/logical/logical_replication_job.go:611  [T10,Vcluster-10,nsql1,job=LOGICAL REPLICATION id=1019956658305499137] 820  hit retryable error subscription: ERROR: job with ID 1019956658274861057 does not exist (SQLSTATE XXUUU)
...

This error occurs because DistSQLPlanner.GetSQLInstanceInfo is using an API that only returns information about KV nodes.

So at a high level what is happening is:

  1. LDR connects to the remote tenant to get a plan for replication.
  2. The remote tenant maps spans to sql instances and returns addresses for kv nodes instead of the sql servers.
  3. The LDR client attempts to dial the KV nodes. The KV cluster is missing the LDR event producer job, so the job gets stuck in a loop.

Here's a minimal diff that allows LDR to work in external process tenants.

diff --git a/pkg/sql/distsql_physical_planner.go b/pkg/sql/distsql_physical_planner.go
index 7f0ccf565c3..36d13477874 100644
--- a/pkg/sql/distsql_physical_planner.go
+++ b/pkg/sql/distsql_physical_planner.go
@@ -51,6 +51,7 @@ import (
        "github.com/cockroachdb/cockroach/pkg/sql/sqlerrors"
        "github.com/cockroachdb/cockroach/pkg/sql/sqlinstance"
        "github.com/cockroachdb/cockroach/pkg/sql/types"
+       "github.com/cockroachdb/cockroach/pkg/util"
        "github.com/cockroachdb/cockroach/pkg/util/encoding"
        "github.com/cockroachdb/cockroach/pkg/util/hlc"
        "github.com/cockroachdb/cockroach/pkg/util/intsets"
@@ -251,7 +252,17 @@ func (dsp *DistSQLPlanner) GetAllInstancesByLocality(
 func (dsp *DistSQLPlanner) GetSQLInstanceInfo(
        sqlInstanceID base.SQLInstanceID,
 ) (*roachpb.NodeDescriptor, error) {
-       return dsp.nodeDescs.GetNodeDescriptor(roachpb.NodeID(sqlInstanceID))
+       instance, err := dsp.sqlAddressResolver.GetInstance(context.Background(), sqlInstanceID)
+       if err != nil {
+               return nil, err
+       }
+       return &roachpb.NodeDescriptor {
+               SQLAddress: util.UnresolvedAddr {
+                       NetworkField: "tcp",
+                       AddressField: instance.InstanceSQLAddr,
+               },
+               Locality: instance.Locality,
+       }, nil
 }

Jira issue: CRDB-44283

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions