Skip to content

Yandex Cloud providers broken in dual-stack environments #3837

@stek29

Description

@stek29

Describe the bug
Yandex Cloud providers are broken in some dual-stack environments due to grpc-go not fully supporting

To Reproduce

  1. Run ESO in a dual-stack environment with ULA IPv6 addresses, so IPv4 will be preferred
  2. Do not provide IPv4 connectivity to Yandex Cloud API – by having wrong routes or not allowing this traffic in network policies and other firewalls
  3. Create an Secret Store
  4. Notice Secret Store being ready
  5. Create an External Secret

Expected behavior
External Secret is successfully synced via IPv6

Screenshots
External Secret won't be synced due to timeouts via IPv4:

Warning  UpdateFailed  3m31s (x17 over 9m3s)  external-secrets  error retrieving secret at .data[0], key: KEYID, err: unable to request secret payload to get secret: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 84.201.168.170:443: i/o timeout"

Here 84.201.168.170 is IPv4 address of lockbox-payload api endpoint:

❯ grpcurl -d '{"api_endpoint_id": "lockbox-payload"}' api.cloud.yandex.net:443 yandex.cloud.endpoint.ApiEndpointService/Get
{
  "id": "lockbox-payload",
  "address": "payload.lockbox.api.cloud.yandex.net:443"
}

❯ host payload.lockbox.api.cloud.yandex.net
payload.lockbox.api.cloud.yandex.net is an alias for public-dpl.lockbox.cloud.yandex.net.
public-dpl.lockbox.cloud.yandex.net has address 84.201.168.170
public-dpl.lockbox.cloud.yandex.net has IPv6 address 2a0d:d6c1:0:1c::1c6

Additional context
Yandex Cloud API is based on gRPC, which doesn't explicitly support dual-stack backends as of writing. There's a Proposal A61: IPv4 and IPv6 Dualstack Backend Support for it, which is not implemented in grpc-go yet – they're working on it right now: grpc/grpc-go#7498

However, notice how SecretStore is initialised successfully, lockbox/certificate manager endpoints are also discovered successfully, but API calls to them during external secret sync fail.
This is caused by ycsdk being used for API Calls for Secret Store initialisation and auth, and gRPC Client being used directly for calls to get the external secrets.

ycsdk uses deprecated DialContext call to create its gRPC Client:
https://github.com/yandex-cloud/go-sdk/blob/1018f7c96dc7bc49822d5fd96be72e8506ed0533/pkg/grpcclient/conn_context.go#L97

Which implicitly sets gRPC resolver to passthrough instead of default dns:
https://github.com/grpc/grpc-go/blob/005b092ca3c279e352f1247c4316b0351dec3a56/clientconn.go#L218-L222

While gRPC client in external-secrets is created via gRPC non-deprecated NewClient call:

return grpc.NewClient(serviceAPIEndpoint.Address,
grpc.WithTransportCredentials(credentials.NewTLS(tlsConfig)),
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: time.Second * 30,
Timeout: time.Second * 10,
PermitWithoutStream: false,
}),
grpc.WithUserAgent("external-secrets"),
)

Short description of passthrough and dns resolvers:

  • passthrough resolver just passes addresses to load-balancer, it doesn't resolve them at all
  • dns resolver tries to discover configuration and endpoints via TXT grpclb and SRV _grpc_config._tcp DNS records, and falls back to discovering IP addresses via A/AAAA records. The order of A/AAAA records returned is determined by Go's standard library LookupHost, which does RFC6724 sorting of returned addresses.

There addresses and configuration (if discovered) are passed to balancer, which is pick_first by default. pick_first should try all addresses serially in order given from resolver, but it does so with deadline for whole Dial being used on each attempt, which means that all time could be spent trying one family – which is not ideal and goes against best practices like Happy Eyeballs:
https://github.com/grpc/grpc-go/blob/2da976983bbb33feb3e25b7daaa8f60b9769adb5/clientconn.go#L1254-L1260
https://github.com/grpc/grpc-go/blob/2da976983bbb33feb3e25b7daaa8f60b9769adb5/clientconn.go#L1329-L1331

RFC6724 sorting prefers IPv4-to-IPv4 over ULA-to-GUA IPv6, so on dual-stack client with ULA IPv6 address IPv4 will be preferred – and in fact only IPv4 will be tried due to pick_first effectively only trying first address.
Notice that it goes other way too – if IPv6 has GUA but is silently broken, IPv4 will never be tried and connection won't be established.

ULA IPv6 addresses being given to pods instead of GUA is quite common in managed k8s.

You can test these methods with grpcurl – here I'm running them on a macOS with dual-stack networking having GUA IPv6 address, and explicitly setting Go resolver to native instead of CGO:

verbose logs for passthrough resolver
❯ GODEBUG=netdns=go+2 GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info grpcurl -d '{"api_endpoint_id": "lockbox-payload"}' passthrough:///api.cloud.yandex.net:443 yandex.cloud.endpoint.ApiEndpointService/Get
2024/08/27 10:06:05 INFO: [core] [Channel #1] Channel created
2024/08/27 10:06:05 INFO: [core] [Channel #1] original dial target is: "passthrough:///api.cloud.yandex.net:443"
2024/08/27 10:06:05 INFO: [core] [Channel #1] parsed dial target is: resolver.Target{URL:url.URL{Scheme:"passthrough", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/api.cloud.yandex.net:443", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
2024/08/27 10:06:05 INFO: [core] [Channel #1] Channel authority set to "api.cloud.yandex.net:443"
2024/08/27 10:06:05 INFO: [core] [Channel #1] Resolver state updated: {
  "Addresses": [
    {
      "Addr": "api.cloud.yandex.net:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "api.cloud.yandex.net:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
2024/08/27 10:06:05 INFO: [core] [Channel #1] Channel switches to new LB policy "pick_first"
2024/08/27 10:06:05 INFO: [core] [pick-first-lb 0x14000335890] Received new config {
  "shuffleAddressList": false
}, resolver state {
  "Addresses": [
    {
      "Addr": "api.cloud.yandex.net:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "api.cloud.yandex.net:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
}
2024/08/27 10:06:05 INFO: [core] [Channel #1 SubChannel #2] Subchannel created
2024/08/27 10:06:05 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/08/27 10:06:05 INFO: [core] [Channel #1] Channel exiting idle mode
2024/08/27 10:06:05 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/08/27 10:06:05 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "api.cloud.yandex.net:443" to connect
2024/08/27 10:06:05 INFO: [core] [pick-first-lb 0x14000335890] Received SubConn state update: 0x14000335a10, {ConnectivityState:CONNECTING ConnectionError:<nil>}
go package net: confVal.netCgo = false  netGo = true
go package net: GODEBUG setting forcing use of Go's resolver
go package net: hostLookupOrder(api.cloud.yandex.net) = files,dns
2024/08/27 10:06:06 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to READY
2024/08/27 10:06:06 INFO: [core] [pick-first-lb 0x14000335890] Received SubConn state update: 0x14000335a10, {ConnectivityState:READY ConnectionError:<nil>}
2024/08/27 10:06:06 INFO: [core] [Channel #1] Channel Connectivity change to READY
{
  "id": "lockbox-payload",
  "address": "payload.lockbox.api.cloud.yandex.net:443"
}
2024/08/27 10:06:06 INFO: [core] [Channel #1] Channel Connectivity change to SHUTDOWN
2024/08/27 10:06:06 INFO: [core] [Channel #1] Closing the name resolver
2024/08/27 10:06:06 INFO: [core] [Channel #1] ccBalancerWrapper: closing
2024/08/27 10:06:06 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN
2024/08/27 10:06:06 INFO: [core] [Channel #1 SubChannel #2] Subchannel deleted
2024/08/27 10:06:06 INFO: [transport] [client-transport 0x140003bc008] Closing: rpc error: code = Canceled desc = grpc: the client connection is closing
2024/08/27 10:06:06 INFO: [transport] [client-transport 0x140003bc008] loopyWriter exiting with error: transport closed by client
2024/08/27 10:06:06 INFO: [core] [Channel #1] Channel deleted
verbose log for dns resolver with GUA
❯ GODEBUG=netdns=go+2 GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info grpcurl -d '{"api_endpoint_id": "lockbox-payload"}' dns:///api.cloud.yandex.net:443 yandex.cloud.endpoint.ApiEndpointService/Get
2024/08/27 10:06:41 INFO: [core] [Channel #1] Channel created
2024/08/27 10:06:41 INFO: [core] [Channel #1] original dial target is: "dns:///api.cloud.yandex.net:443"
2024/08/27 10:06:41 INFO: [core] [Channel #1] parsed dial target is: resolver.Target{URL:url.URL{Scheme:"dns", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/api.cloud.yandex.net:443", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
2024/08/27 10:06:41 INFO: [core] [Channel #1] Channel authority set to "api.cloud.yandex.net:443"
2024/08/27 10:06:41 INFO: [core] [Channel #1] Channel exiting idle mode
go package net: confVal.netCgo = false  netGo = true
go package net: GODEBUG setting forcing use of Go's resolver
go package net: hostLookupOrder(api.cloud.yandex.net) = files,dns
2024/08/27 10:06:43 INFO: [core] [Channel #1] Resolver state updated: {
  "Addresses": [
    {
      "Addr": "[2a0d:d6c1:0:1c::4e]:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    },
    {
      "Addr": "84.201.181.26:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "[2a0d:d6c1:0:1c::4e]:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    },
    {
      "Addresses": [
        {
          "Addr": "84.201.181.26:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
2024/08/27 10:06:43 INFO: [core] [Channel #1] Channel switches to new LB policy "pick_first"
2024/08/27 10:06:43 INFO: [core] [pick-first-lb 0x140000caa80] Received new config {
  "shuffleAddressList": false
}, resolver state {
  "Addresses": [
    {
      "Addr": "[2a0d:d6c1:0:1c::4e]:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    },
    {
      "Addr": "84.201.181.26:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "[2a0d:d6c1:0:1c::4e]:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    },
    {
      "Addresses": [
        {
          "Addr": "84.201.181.26:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
}
2024/08/27 10:06:43 INFO: [core] [Channel #1 SubChannel #2] Subchannel created
2024/08/27 10:06:43 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/08/27 10:06:43 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/08/27 10:06:43 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "[2a0d:d6c1:0:1c::4e]:443" to connect
2024/08/27 10:06:43 INFO: [core] [pick-first-lb 0x140000caa80] Received SubConn state update: 0x140000cac00, {ConnectivityState:CONNECTING ConnectionError:<nil>}
2024/08/27 10:06:44 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to READY
2024/08/27 10:06:44 INFO: [core] [pick-first-lb 0x140000caa80] Received SubConn state update: 0x140000cac00, {ConnectivityState:READY ConnectionError:<nil>}
2024/08/27 10:06:44 INFO: [core] [Channel #1] Channel Connectivity change to READY
{
  "id": "lockbox-payload",
  "address": "payload.lockbox.api.cloud.yandex.net:443"
}
2024/08/27 10:06:47 INFO: [core] [Channel #1] Channel Connectivity change to SHUTDOWN
2024/08/27 10:06:47 INFO: [core] [Channel #1] Closing the name resolver
2024/08/27 10:06:47 INFO: [core] [Channel #1] ccBalancerWrapper: closing
2024/08/27 10:06:47 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN
2024/08/27 10:06:47 INFO: [core] [Channel #1 SubChannel #2] Subchannel deleted
2024/08/27 10:06:47 INFO: [transport] [client-transport 0x140001f1b08] Closing: rpc error: code = Canceled desc = grpc: the client connection is closing
2024/08/27 10:06:47 INFO: [transport] [client-transport 0x140001f1b08] loopyWriter exiting with error: transport closed by client
2024/08/27 10:06:47 INFO: [core] [Channel #1] Channel deleted

and here I'm running DNS resolver on macOS with non-GUA IPv6 addresses – notice how IPv4/IPv6 order changes:

dns resolver with non-GUA IPv6 address
❯ GODEBUG=netdns=go+2 GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info grpcurl -d '{"api_endpoint_id": "lockbox-payload"}' dns:///api.cloud.yandex.net:443 yandex.cloud.endpoint.ApiEndpointService/Get
2024/08/27 10:07:29 INFO: [core] [Channel #1] Channel created
2024/08/27 10:07:29 INFO: [core] [Channel #1] original dial target is: "dns:///api.cloud.yandex.net:443"
2024/08/27 10:07:29 INFO: [core] [Channel #1] parsed dial target is: resolver.Target{URL:url.URL{Scheme:"dns", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/api.cloud.yandex.net:443", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
2024/08/27 10:07:29 INFO: [core] [Channel #1] Channel authority set to "api.cloud.yandex.net:443"
2024/08/27 10:07:29 INFO: [core] [Channel #1] Channel exiting idle mode
go package net: confVal.netCgo = false  netGo = true
go package net: GODEBUG setting forcing use of Go's resolver
go package net: hostLookupOrder(api.cloud.yandex.net) = files,dns
2024/08/27 10:07:30 INFO: [core] [Channel #1] Resolver state updated: {
  "Addresses": [
    {
      "Addr": "84.201.181.26:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    },
    {
      "Addr": "[2a0d:d6c1:0:1c::4e]:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "84.201.181.26:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    },
    {
      "Addresses": [
        {
          "Addr": "[2a0d:d6c1:0:1c::4e]:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
2024/08/27 10:07:30 INFO: [core] [Channel #1] Channel switches to new LB policy "pick_first"
2024/08/27 10:07:30 INFO: [core] [pick-first-lb 0x140001111a0] Received new config {
  "shuffleAddressList": false
}, resolver state {
  "Addresses": [
    {
      "Addr": "84.201.181.26:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    },
    {
      "Addr": "[2a0d:d6c1:0:1c::4e]:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "84.201.181.26:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    },
    {
      "Addresses": [
        {
          "Addr": "[2a0d:d6c1:0:1c::4e]:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
}
2024/08/27 10:07:30 INFO: [core] [Channel #1 SubChannel #2] Subchannel created
2024/08/27 10:07:30 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/08/27 10:07:30 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/08/27 10:07:30 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "84.201.181.26:443" to connect
2024/08/27 10:07:30 INFO: [core] [pick-first-lb 0x140001111a0] Received SubConn state update: 0x14000111320, {ConnectivityState:CONNECTING ConnectionError:<nil>}
2024/08/27 10:07:30 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to READY
2024/08/27 10:07:30 INFO: [core] [pick-first-lb 0x140001111a0] Received SubConn state update: 0x14000111320, {ConnectivityState:READY ConnectionError:<nil>}
2024/08/27 10:07:30 INFO: [core] [Channel #1] Channel Connectivity change to READY
{
  "id": "lockbox-payload",
  "address": "payload.lockbox.api.cloud.yandex.net:443"
}
2024/08/27 10:07:31 INFO: [core] [Channel #1] Channel Connectivity change to SHUTDOWN
2024/08/27 10:07:31 INFO: [core] [Channel #1] Closing the name resolver
2024/08/27 10:07:31 INFO: [core] [Channel #1] ccBalancerWrapper: closing
2024/08/27 10:07:31 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN
2024/08/27 10:07:31 INFO: [core] [Channel #1 SubChannel #2] Subchannel deleted
2024/08/27 10:07:31 INFO: [transport] [client-transport 0x14000193b08] Closing: rpc error: code = Canceled desc = grpc: the client connection is closing
2024/08/27 10:07:31 INFO: [transport] [client-transport 0x14000193b08] loopyWriter exiting with error: transport closed by client
2024/08/27 10:07:31 INFO: [core] [Channel #1] Channel deleted

Possible solutions

  • Set resolver to passthrough explicitly, while keeping current code and architecture of Yandex Cloud secret stores – provided in fix: set grpc resolver explicitly in yandex #3838
  • Do not use grpc directly in any way and only use ycsdk calls and methods

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions