Skip to content

use EndpointBuilder in CDS#45733

Merged
istio-testing merged 15 commits intoistio:masterfrom
stevenctl:ep-builder-cds
Aug 31, 2023
Merged

use EndpointBuilder in CDS#45733
istio-testing merged 15 commits intoistio:masterfrom
stevenctl:ep-builder-cds

Conversation

@stevenctl
Copy link
Copy Markdown
Contributor

@stevenctl stevenctl commented Jun 29, 2023

Changes:

  • Use EndpointBuilder in favor of buildLocalityLbEndpoints in cluster builder. That cluster builder method is a partial duplication of a lot of the same logic we have for EDS.
  • Enable multi-network at CDS level

Benchmarks are no longer affected.

                                              │ /home/landow/old.txt │        /home/landow/new.txt        │
                                                │        sec/op        │   sec/op     vs base               │
ClusterGeneration/gateways-24                              4.844m ± 2%   4.875m ± 4%       ~ (p=0.529 n=10)
ClusterGeneration/gateways-shared-24                       4.864m ± 4%   4.873m ± 3%       ~ (p=0.796 n=10)
ClusterGeneration/knative-gateway-24                       12.55µ ± 2%   12.71µ ± 4%       ~ (p=0.393 n=10)
ClusterGeneration/http-24                                  868.8µ ± 4%   877.8µ ± 4%       ~ (p=0.631 n=10)
ClusterGeneration/tcp-24                                   867.0µ ± 4%   874.0µ ± 3%       ~ (p=0.579 n=10)
ClusterGeneration/tls-24                                   881.4µ ± 4%   884.2µ ± 4%       ~ (p=0.912 n=10)
ClusterGeneration/auto-24                                 1021.5µ ± 5%   994.1µ ± 7%       ~ (p=0.247 n=10)
ClusterGeneration/telemetry-api-24                         2.420m ± 5%   2.461m ± 3%       ~ (p=0.436 n=10)
ClusterGeneration/virtualservice-24                        565.5µ ± 3%   573.3µ ± 4%       ~ (p=0.143 n=10)
ClusterGeneration/serviceentry-workloadentry-24            3.318m ± 5%   3.238m ± 6%       ~ (p=0.529 n=10)
geomean                                                    990.7µ        992.7µ       +0.21%

                                                │ /home/landow/old.txt │        /home/landow/new.txt         │
                                                │        kb/msg        │   kb/msg    vs base                 │
ClusterGeneration/gateways-24                               390.3 ± 0%   390.3 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/gateways-shared-24                        390.3 ± 0%   390.3 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/knative-gateway-24                        1.058 ± 0%   1.058 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/http-24                                   86.34 ± 0%   86.34 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/tcp-24                                    86.95 ± 0%   86.95 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/tls-24                                    86.95 ± 0%   86.95 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/auto-24                                   101.8 ± 0%   101.8 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/telemetry-api-24                          172.9 ± 0%   172.9 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/virtualservice-24                         39.70 ± 0%   39.70 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/serviceentry-workloadentry-24             379.1 ± 0%   379.1 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                     88.01        88.01       +0.00%
¹ all samples are equal

                                                │ /home/landow/old.txt │          /home/landow/new.txt          │
                                                │    resources/msg     │ resources/msg  vs base                 │
ClusterGeneration/gateways-24                              1.003k ± 0%     1.003k ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/gateways-shared-24                       1.003k ± 0%     1.003k ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/knative-gateway-24                        3.000 ± 0%      3.000 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/http-24                                   105.0 ± 0%      105.0 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/tcp-24                                    105.0 ± 0%      105.0 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/tls-24                                    105.0 ± 0%      105.0 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/auto-24                                   105.0 ± 0%      105.0 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/telemetry-api-24                          411.0 ± 0%      411.0 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/virtualservice-24                         105.0 ± 0%      105.0 ± 0%       ~ (p=1.000 n=10) ¹
ClusterGeneration/serviceentry-workloadentry-24             411.0 ± 0%      411.0 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                     151.8           151.8       +0.00%
¹ all samples are equal



cc @kdorosh @howardjohn

@stevenctl stevenctl requested review from a team as code owners June 29, 2023 16:02
@istio-testing istio-testing added do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 29, 2023
@stevenctl
Copy link
Copy Markdown
Contributor Author

Next, I'd like to consolidate on shards vs service instances but that has implications for cds cache.

Copy link
Copy Markdown
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't read the code yet, but initial reaction: there is a lot of effort to handle the special case of EDS that it doesn't use snapshoting mechanism (push context) and allows state to change at any time. Spreading that property made be tricky

@stevenctl
Copy link
Copy Markdown
Contributor Author

/test benchmark_istio

Comment on lines 484 to 428
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally agree that removing this code duplication is probably worth it, unless spreading the mutability of EDS that @howardjohn was talking about is both too difficult and completely necessary. because off the cuff id rather remove the mutability property for EDS and use snapshotting or something else since that's easier to reason about.

why is EDS handled differently? performance reasons? EDS being more dynamic?

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Jul 13, 2023
@istio-testing istio-testing added needs-rebase Indicates a PR needs to be rebased before being merged and removed needs-rebase Indicates a PR needs to be rebased before being merged labels Jul 14, 2023
@stevenctl stevenctl linked an issue Jul 18, 2023 that may be closed by this pull request
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Jul 19, 2023
@istio-testing istio-testing added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 19, 2023
@stevenctl stevenctl force-pushed the ep-builder-cds branch 3 times, most recently from b5a9bba to cdc8548 Compare July 19, 2023 15:07
@hzxuzhonghu
Copy link
Copy Markdown
Member

I feel a little concerned about such big refactor, since it touches too many files. With CI, it has failed multi times, so need much more look

@istio-testing istio-testing added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 20, 2023
@stevenctl
Copy link
Copy Markdown
Contributor Author

With CI, it has failed multi times, so need much more look

FWIW, it wasn't really done yet. The failures from earlier were all due to make format breaking some import statements making tests not compile.

@stevenctl
Copy link
Copy Markdown
Contributor Author

/test all

@stevenctl stevenctl marked this pull request as ready for review August 22, 2023 20:23
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Aug 22, 2023
return locEps
}

func (b *EndpointBuilder) filterIstioEndpoint(ep *model.IstioEndpoint, svcPort *model.Port) bool {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be looked at closely. Must cover what we have in the original CDS code.

}
h.Write(Separator)

t.endpointBuilder.WriteHash(h)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to add more factor for cache key calculation when we just do a cleanup

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is mostly that EndpointBuilder takes a model.Proxy and PushContext. It can decide to use whatever parts of that it needs. The cache key for EndpointBuilder will be updated to include those, but now that affects CDS.
I want to make it hard(er) to write caching bugs.

Right now endpoint builder has these things that CDS dosn't:

  • isClusterLocal
  • node type
  • failover priority labels

we just do a cleanup

I'm not sure what you mean.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is mostly that EndpointBuilder takes a model.Proxy and PushContext. It can decide to use whatever parts of that it needs. The cache key for EndpointBuilder will be updated to include those, but now that affects CDS.
I want to make it hard(er) to write caching bugs.

Right now endpoint builder has these things that CDS dosn't:

  • isClusterLocal
  • node type
  • failover priority labels

we just do a cleanup

I'm not sure what you mean.

b.subsetName = strings.TrimPrefix(b.subsetName, "http/")
b.subsetName = strings.TrimPrefix(b.subsetName, "tcp/")
}
b.mtlsChecker = newMtlsChecker(b.push, b.port, b.destinationRule.GetRule(), b.subsetName)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: donot couple mtlsChecker with subsetInfo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mtlsChecker depends on the subset name. I could make it so we build it in the constructor, and the build it again if the subset changes?

Comment on lines +477 to +479
// If we don't know the address we must eventually use a gateway address
if ep.Address == "" && ep.Network == b.network {
return false
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we supported this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already in endpoint_builder

Comment on lines +381 to +382
if noCache {
ep.ComputeEnvoyEndpoint(eep)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confused why set it for cds endpoint, previously it is for eds

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be !noCache.

}

// Apply the Split Horizon EDS filter, if applicable.
locEps = b.EndpointsByNetworkFilter(locEps)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest not apply to cds endpoint, just keep as before. Not quite sure whether we set the network label correct

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. The original motivation (for me) was to get multi-network working for non EDS. It's possible I have a ServiceEntry that has mixed IPs/Hostnames so it must be a DNS cluster. I want the IPs to get mapped.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean for cds, we donot apply filter now. This pr is already big, if we want to apply network filter for cds endpoint, we can do it later

if shardKey.Cluster != b.clusterID {
// If the downstream service is configured as cluster-local, only include endpoints that
// reside in the same cluster.
if isClusterLocal || b.service.Attributes.NodeLocal {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate with the filter later

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very slight optimization allowing us to not loop over endpoints[] because every one of them will be a miss.

@istio-testing istio-testing added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 23, 2023
metadataCerts: cb.metadataCerts,
peerAuthVersion: cb.req.Push.AuthnPolicies.GetVersion(),
serviceAccounts: cb.req.Push.ServiceAccounts(service.Hostname, service.Attributes.Namespace, port.Port),
endpointBuilder: endpoints.NewCDSEndpointBuilder(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of the slowdown might be that we do this unconditionally. None of our benchmarks are DNS resolution. This patch should have almost no impact.

Copy link
Copy Markdown
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General idea is fine but need to fix the performance


// We have a cache miss, so we will re-generate the cluster and later store it in the cache.
lbEndpoints := cb.buildLocalityLbEndpoints(clusterKey.proxyView, service, port.Port, nil)
lbEndpoints := clusterKey.endpointBuilder.FromServiceEndpoints()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lbEndpoints := clusterKey.endpointBuilder.FromServiceEndpoints()
var lbEndpoints []*endpoint.LocalityLbEndpoints
// We have a cache miss, so we will re-generate the cluster and later store it in the cache.
if (service.Resolution == model.DNSLB || service.Resolution == model.DNSRoundRobinLB) {
lbEndpoints = clusterKey.endpointBuilder.FromServiceEndpoints()
}

I think this fixes the performance regression, which is due to always computing endpoints even if its an EDS cluster.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.. I did this on the original draft and un-did it. Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still +5% mean. The ones that have a higher regression are on the order of us.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it was passing around/copying the EndpointBuilder struct. Made that a pointer and everything looks good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is equal now.

stevenctl and others added 3 commits August 23, 2023 13:58
@stevenctl
Copy link
Copy Markdown
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

multi-network filters are not applied to non-EDS clusters

6 participants