[Kubernetes secret provider] Add cache for the secrets#3822
[Kubernetes secret provider] Add cache for the secrets#3822constanca-m merged 29 commits intoelastic:mainfrom
Conversation
|
This pull request does not have a backport label. Could you fix it @constanca-m? 🙏
NOTE: |
|
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
|
Reminder that we need also documentation update for this change in https://www.elastic.co/guide/en/fleet/current/kubernetes_secrets-provider.html |
blakerouse
left a comment
There was a problem hiding this comment.
The biggest issue I see in this PR is how does a key get removed from the cache if it is no longer referenced from the policy? Let say the policy references kubernetes_secrets.somenamespace.somesecret.value1 and then changes to kubernetes_secrets.somenamespace.somesecret.value2. This change will keep caching kubernetes_secrets.somenamespace.somesecret.value1 forever.
To solve the problem you should add a last accessed time per cached key, and during each update cycle and based on an interval you should remove keys that have not been accessed. That will solve the issue I described above.
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
@blakerouse thanks for this . Makes sense. |
And should we give that option also to the user or should we enforce that time ourselves @gizas ? Edit: I enforced it as 1h. Currently, user cannot change it. |
- Update duration config format
- Update duration config format
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
|
This is also missing unit tests coverage for any of these additions, that needs to be added as well. |
- Add ttl for both update and delete
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
Yes I am still waiting on my comment about holding the lock the entire time the code is refreshing the cache. I think that is still a big issue that needs to be looked at. |
I don't have another explanation on that other than this comment. Is that a blocker? @blakerouse |
cmacknz
left a comment
There was a problem hiding this comment.
Took another look, there is still at least one bug and a few cases I could spot where splitting the cache changes across separate "read lock" and "write lock" sections introduces concurrency problems.
This is because the manipulation you are doing to the cache is no longer atomic across a single function body allowing the cache to change in the middle of the function execution when you unlock the read lock and acquire the write lock.
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
cmacknz
left a comment
There was a problem hiding this comment.
One last thought, it seems like it would be straightforward to add a configuration entry that disables the cache.
That way if there is some unexpected surprise we can revert to the previous behavior without needing a release.
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
- Remove goroutine - Refactor timeout name to requestTimeout
I added this option in my last commit. I also added the proper unit test for it. |
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
|
|
||
| // if value is still not present in cache, it is possible we haven't tried to fetch it yet | ||
| if !ok { | ||
| value, ok := p.addToCache(key) |
There was a problem hiding this comment.
Why grab the lock inside of the function to set the new value. Then only 4 lines below grab it again to set the lastAccess time?
Why not just set lastAccess inside of addToCache and then return inside of the if statement? That would prevent the need to grab the lock twice for the same key.
There was a problem hiding this comment.
Why not just set lastAccess inside of addToCache and then return inside of the if statement?
Because then we would be changing lastAccess in two functions, when we don't need it. The logic is that lastAccess is only changed when it is required (by calling the get from cache).
There was a problem hiding this comment.
To me its worse to grab a lock, release the lock, grab the lock again, then release it again all in the same path. It should grab the lock do what it needs to do and release it.
|
blakerouse
left a comment
There was a problem hiding this comment.
Thanks for all the changes and fixes. I am glad a solution was able to be created for not holding the lock the entire time a refresh loop occurs.
My last point of hold/releasing followed by another hold/release is not a blocker, just seems like it would be less CPU cycles to grab the lock once. I understand your trying to be my DRY with the code, but for something as simple as setting a single attribute timestamp I think less lock contention would be worth it.

88.9% Coverage on New Code
0.0% Duplication on New Code
What does this PR do?
Currently, we make a request to the API Server every time we need to get the value of a secret. This issue is better explained here. Discussion also present in the issue.
With this PR, we use a map to store the secrets like this:
The secrets are accessed from the outside through the function Fetch. If we try to retrieve a secret that is not present in our map, then we make a request to the API Server to obtain its value and store it in the map. This function never causes an update on the secret values.
The only difference from this approach to the current one is that we know store the values of the secrets and secret values.
Warning: Sometimes the secret values are not updated in time and we can be using incorrect ones until the update happens. This is also happening now.
Why is it important?
To stop overwhelming the API Server with secret requests.
Checklist
./changelog/fragmentsusing the changelog toolHow to test this PR locally
Related issues
Screenshots
There should not be a change in behavior apart from a decrease in calls to the API Server. Everything else works as expected/before: