Istio uses significantly high memory
Bug Description
We upgraded to the recent Istio version and immediately we saw a jump of Istiod memory usage. We have increased the memory to 14G and still it is not sufficient. Ultimately it gets OOM Killed.
Version
Master
Additional Information
No response
Initial Analysis turns out that it is because of RDS Cache enabled byPILOT_ENABLE_RDS_CACHE.
Here are some pprof outputs
@hzxuzhonghu I suspect it could be because https://github.com/istio/istio/blob/master/pilot/pkg/networking/core/v1alpha3/route/route_cache.go#L48 - we are storing entire vitualservice in the cache key instead of ConfigKey?
Config.Spec has a reference to underlying vs
And actually lru route cache key is only md5 checksum, not sure how you profile. And from your profile, I could not see how much memory totally istiod costs.
Config.Spec has a reference to underlying vs
Yes
And actually lru route cache key is only md5 checksum, not sure how you profile. And from your profile, I could not see how much memory totally istiod costs.
@deveshkandpal1224 can you please share complete profile before/after RDS cache and the memory usage?
can you upload the actual profile targz as well when you get the new one
@howardjohn @hzxuzhonghu - sorry for the delay, attached profiles as requested. PTAL - let me know if there is anything else that you need. Thanks!
low-memory ->
pprof.pilot-low-mem-discovery.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz
high memory ->
pprof.pilot-discovery-high-mem.alloc_objects.alloc_space.inuse_objects.inuse_space.003.pb.gz
pprof.pilot-discovery-high-mem.alloc_objects.alloc_space.inuse_objects.inuse_space.004.pb.gz
Wow so if I read this right, 4gb of keys. key is 32bytes. That is 125 000 000 keys. But XDS cache size is 60000 max. So how can we have so much memory to store the key...?
Also seems like we actively have 626mb in the params = append(params, vs.Name+"/"+vs.Namespace) code of Key(). That doesn't seem right, it should free immediately - maybe just GC delay though. I would note we could trivially optimize that to call hash.Write immediately rather than have an intermediate
Wow so if I read this right, 4gb of keys. key is 32bytes. That is 125 000 000 keys. But XDS cache size is 60000 max. So how can we have so much memory to store the key...?
Is some thing wrong with our key calculation or is each key is unnecessarily holding on to some stuff?
@howardjohn @ramaraochavali i just went and did what howard was saying to optimize the hash key generation. https://github.com/istio/istio/pull/39412
There is still some more optimizations that could be done, every hash Write is copying the string to convert it to a []byte, this copying could be avoided by using some unsafe code which is done in many libraries. Should we do this as well? is this considered performance-critical paths?
@howardjohn i don't really know a lot of Istio code, but it does seem that we're leaking things in XdsCache, specifically in configIndex.
When an entry gets added we "update" the configIndex here. This ends up adding entries to configIndex but don't seem to ever be cleared. There is a Clear method on the XdsCache but it is only called from one place, so i'm guessing we're not clearing stuff when we should.
What happens if the internal LRU store evicts an entry? it would seem we definitely leak things in this case since our evict function only generates metrics.
Is it correct for us to be wrapping LRU this way?
I haven't had a chance to dig into it but it seems plausible that would cause it
How many resources do you create? I guess it is the scalability test.
Not really. It is one of the biggest clusters. But other clusters also have seen increase in memory but not at this level.
@howardjohn another thing to note of the current cache implementation, the internal LRU store is bounded with a max size, but everytime we add an entry we're adding stuff to configIndex and typeIndex in an unbounded fashion, we're not removing elements from these maps when LRU evicts an entry, so these maps can grow unbounded.
I still don't know if we actually leak things or not, but we definitely keep elements there longer than they should and we don't have mechanisms to limit the growth of those indexes.
I also don't find a way to easily fix this, since the evict function receives the key and value that we store in the LRU store but we cannot derive the keys of indexConfig and typeConfig from this keys or values.
@howardjohn @ramaraochavali @hzxuzhonghu what about something like this? https://github.com/istio/istio/pull/39435
I don't know if this would be the right approach, though. We would have some increased memory overhead of storing the dependencies (might not be too much) and writes that cause evictions would be slower because we now have to clear the evicted key.
@ramaraochavali do you have the cache metrics enabled? is it possible for you to check if there were evictions on the cluster which showed the huge memory increase?
If there are no evictions, then my theory would be incorrect.
@sschepens No, your theory is correct forever, regardless of the issue. We should think about how to fix it, but not to introduce more mem cost
@sschepens yes we have and it seems there are evictions - Check the last one month data

@ramaraochavali thanks for the info, there seems to be a lot of evictions, so yeah I think that could be the real problem.
Is there any chance you could deploy my PR to one of those clusters and get another profile?
Is there any chance you could deploy my PR to one of those clusters and get another profile?
It is little tough. But will try and let you know.
@ramaraochavali: is memory leak mentioned in this issue applicable to 1.13 as well? I see from the issue description, that you did an upgrade to the master branch.
@nirvanagit It should be applicable I guess. But the impact may vary as the XdsCache is there in 1.13
@sschepens So we have deployed the fix and enabled RDS cache back. We still see Istiod getting OOM Killed. Here are the latest pprof outputs pprof.pilot-discovery.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz pprof.pilot-discovery.alloc_objects.alloc_space.inuse_objects.inuse_space.006.pb.gz
cacheValue now records dependentConfigs, which makes up 52.66% memory?
I think the problem is this https://github.com/golang/go/issues/20135 - we need to periodically cleanup the index config map. I am looking at it
But From the result, it is not map. It is dependentConfigs slice
dependentConfigs is a set (map) not slice.
?? DependentConfigs() []ConfigKey