Cache API key doc to reduce traffic to the security index by ywangd · Pull Request #59376 · elastic/elasticsearch

ywangd · 2020-07-13T00:30:16Z

Getting the API key document form the security index is the most time consuing part of the authentication flow (>60% if index is local and >90% if index is remote). This traffic can be avoided with cache. Initially we also plan to avoid deserialising source of GetApiKeyDocument response if "primary term" and "sequence number" are unchanged. However, given the document fetching itself is dominant in execution time, it does not seem to be worthwhile to optimise for deserialisation. Hence, for uncached document, the cache will fetch the document and also simply deserialise it without further checking.

When a large number of API keys are in use, we expect most of them are created with exactly the same role-descriptors and limited-by-role-descriptors. It is wasteful to cache these identical role descriptors repeatitively for different API keys. Hence a two layer caches are put in place:

The first cache is keyed by docId and values are content of the API key document, except that the role-descriptors and limited-by-role-descriptors are replacecd by their sha256 hashes respectively.
The second cache is keyed by the above sha256 hashes and values are the corresponding role-descriptors.

The lifecyces of the above two caches are mostly tied together and they need work together in a way to ensure least possibility of stale results. Hence a new SecurityCaches class is added and it implements all necessary handlings as discussed in #55836.

@tvernum This is a quick attempt to implement a cache for API key document. It intentionally omits some parts, e.g. tests, configuration, to get a quick turnaround for proving the idea. It is easier to discuss when we have something concrete. We could try formalise it for v7.9 if feasible. But I am also OK to delay it if more careful evaluation is needed. Thanks!

Resolves: #53940

ywangd · 2020-07-13T00:35:04Z

Please note this PR is built upon the yet-to-be-approved #58156. Therefeore many changes are actually not relevant to this PR itself. Here are a list of actual relevant changes:

ywangd · 2020-07-13T04:25:44Z

@elasticmachine run elasticsearch-ci/1

tvernum · 2020-07-13T08:25:33Z

I don't think it's worth pushing this in before FF.
It's a good change, but I think we need time to discuss things such the split doc/auth caches.

ywangd · 2020-07-13T08:59:49Z

I think we need time to discuss things such the split doc/auth caches.

Thanks, I agree. The changes are still rough. I also thought about merging doc and auth caches but didn't do it mainly because of complexity and potential implications of complex changes. It requires more time to get right.
Also a big part of the change is about refactoring the caching behaviour into a separate class, which probably deserves its own PR.

…doc-avoidance

ywangd · 2020-08-11T08:58:59Z

I spent some time look into how we could have a single cache for both API key doc and Auth hash. I came up with a version that works. However, it comes with high complexity because the two parts of data have different lifecycles.

With the single cache approach, a cache item contains both content of an API key document and the Authentication hash. These two parts can vary independantly. That is, when the document needs to be refreshed for caching, the authentication hash can still be valid because it is based on user-provided credentials and only associated with the document ID, not its content. Similarly, when the authentication hash needs to be updated, the document can still be valid because unlike authentication hash it does not change based on user-provided credentials. So what we need is a cache where the cache item can be partially invalidated. This requirement is basically asking for two separate caches. It is possible to make it work with a single cache as mentioned earlier, but it leads to high complexity due to the diverging requirement plus how it could work with a distributed system. Therefore, even though I already have an implementation for it, I still prefer the approrach of separate caches. Honestly, I had fun caming up with my current implementaiton and trying to catch all edge cases. But I am concerned that it is difficult to explain how it works and maintain it a few years from now.

After discussion with @tvernum, we decided to go with separate caches. There is no overlap between contents of the two caches. Hence we do not have concern in terms of duplication (memory consumption).

This reverts commit df2a1d8.

…doc-avoidance

...n/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityCacheRegistry.java

ywangd · 2020-08-13T13:08:56Z

@tvernum After a few back and forth, here is the version that I am settled with. It touches the idea of consolidation for cache implementaitons, though with a much less ambitious approach. In addition to adding a cache for API key document, it also provides two enhancements:

A SecurityCache class to encapsulate best practicies around caching expiry and stale results. It is fairly simple in that it does not try to be a high level generic solution to all cache requirements, which is difficult and the complexity is not worth it (I had a version with 5 type parameters). Thus it requires a bit of manual work when using it. We can improve it down the path, but for now I think it is reasonable start point.
A SecurityCacheRegistry that centralise the caching clearing logic for Rest/Transport handlers. Right now it is just used for API key caches, including the new API key doc cache and the existing auth cache. But I can see it to replace the handlers for role cache and privilege cache and possibly realm cache as well. These can be done in a separate PR if the approach here looks good to you. This possibility is also in line with your previous comment on the privilege cache PR.

The PR still needs a few more things:

Tests
Client side ecode
Documentation

But I'd appreciate your review on the approach before I polish it.

…doc-avoidance

ywangd · 2020-09-07T03:44:41Z

@tvernum I updated the PR based on your suggestion with some name tweaks. The changes are basically to simplify the SecurityCache, which in fact is now replaced by ConsistentCache. The ConsistentCache#checkpoint method returns a Checkpoint interface that can be used to cache given key/value pair while minimizing the possibility of caching stale results.

The SecurityCacheRegistry is renamed to CacheInvalidatorRegistry since that is all it does, i.e. providing cache invalidation services to different caches.

Could you please reivew the current approach? I'll proceeds with adding tests etc once we can agree on the overall approach. Thanks!

…doc-avoidance

tvernum · 2020-09-08T04:20:04Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/ApiKeyService.java

+                .setMaximumWeight(maximumWeight)
                .build();
+            this.roleDescriptorsBytesCache = CacheBuilder.<String, BytesReference>builder()
+                .setExpireAfterAccess(ttl)


I wonder about this.
It defaults to 24h, and we don 't have much in place to clean it up.
Should it really cache for that long? It's expiry after access, so 1h feels like it would be sufficient.

From my brief reading, once the doc cache expires (5m after last write) there will be no incoming references to these entries. They may get revived, but I don't think holding onto that memory for 24h is justified.

I changed it to 1h. It was set to 24h because

24h is an existing setting, so felt easy to reuse. I understand technically it has different meaning. But the subtlety here feels irrelevant to end-users.

Unlike API key document cache, the role descriptor cache, in theory, never "expires" in that they will never be outdated. Sure the entries may no longer be needed, but the association between a hash and its original content can never be invalid. Also many API keys may share the same role descriptors, for example, the superuser (limiting) role. These made me feel it is plausible to cache the descriptors for longer. Because even when a Key is expired or invalidated or haven't checked in for a while, it is likey a new key will be replacing it and the new key will have the exact same role descriptors and thus reuse the cache entries.

With above being said, I don't think the benefit of caching for longer is anyway significant, since the main performance hit is the document retrieval itself. So I am OK to have it at a shorter duration.

...ecurity/src/main/java/org/elasticsearch/xpack/security/support/CacheInvalidatorRegistry.java

Co-authored-by: Tim Vernum <tim@adjective.org>

ywangd · 2020-10-04T13:10:45Z

@albertzaharovits I updated this PR per your review and discussion. It is ready for another look. Thanks!

…doc-avoidance

tvernum

LGTM, but I have a few suggestions (some of which are strong suggestions).

client/rest-high-level/src/main/java/org/elasticsearch/client/SecurityRequestConverters.java

...rest-high-level/src/main/java/org/elasticsearch/client/security/ClearApiKeyCacheRequest.java

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/ApiKeyService.java

...src/main/java/org/elasticsearch/xpack/security/action/TransportClearSecurityCacheAction.java

tvernum · 2020-10-06T03:30:27Z

...re/src/main/java/org/elasticsearch/xpack/core/security/action/ClearSecurityCacheRequest.java

+        super.writeTo(out);
+        out.writeString(cacheName);
+        out.writeOptionalStringArray(keys);
+    }


Ideally we would have serialization tests for every class that has a writeTo method.

Add a serialisation test.

...ty/src/test/java/org/elasticsearch/xpack/security/support/CacheInvalidatorRegistryTests.java

tvernum · 2020-10-06T03:36:40Z

...ecurity/src/main/java/org/elasticsearch/xpack/security/support/CacheInvalidatorRegistry.java

+        final CacheInvalidator cacheInvalidator = cacheInvalidators.get(cacheName);
+        if (cacheInvalidator != null) {
+            cacheInvalidator.invalidate(keys);
+        }


The lack of an else seems like dangerous leniency. Why do we silently pretend to invalidate a cache that doesn't exist?

This is a good point and there is no reason for this leniency. Add an else branch to throw exception and also relevant tests.

...est/java/org/elasticsearch/xpack/security/support/InvalidationCountingCacheWrapperTests.java

Co-authored-by: Tim Vernum <tim@adjective.org>

albertzaharovits

LGTM Good efforts on this one!

) Getting the API key document form the security index is the most time consuing part of the API Key authentication flow (>60% if index is local and >90% if index is remote). This traffic is now avoided by caching added with this PR. Additionally, we add a cache invalidator registry so that clearing of different caches will be managed in a single place (requires follow-up PRs).

…63319) Getting the API key document form the security index is the most time consuing part of the API Key authentication flow (>60% if index is local and >90% if index is remote). This traffic is now avoided by caching added with this PR. Additionally, we add a cache invalidator registry so that clearing of different caches will be managed in a single place (requires follow-up PRs).

This PR is a follow-up refactoring for #59376. It replaces InvalidationCountingCacheWrapper with a simpler and more resuable LockingAtomicCounter. The change includes extracting the logic of "minimizing chance of caching stale results" into a new class, which is now generic and not tied to any cache related operations. This in turn allows it to be reused by different caches, which often have subtle, important and anonying differences that make it impossible (or at least cubersome and painful) to have a generic solution. So this change of reducing the size of reusable code is the right move. As an example of applying this new pattern, the caches for NativePrivilegeStore are now migrated and managed by the centrialied CacheInvalidatorRegistry. The clear privilege cache rest and transport actions are also updated to use the new implementation. The Request/Response classes are intentionally untouched and plan to have a separate PR to address them because they would involve user visible deprecation changes. Co-authored-by: Albert Zaharovits <albert.zaharovits@elastic.co>

This PR is a follow-up refactoring for elastic#59376. It replaces InvalidationCountingCacheWrapper with a simpler and more resuable LockingAtomicCounter. The change includes extracting the logic of "minimizing chance of caching stale results" into a new class, which is now generic and not tied to any cache related operations. This in turn allows it to be reused by different caches, which often have subtle, important and anonying differences that make it impossible (or at least cubersome and painful) to have a generic solution. So this change of reducing the size of reusable code is the right move. As an example of applying this new pattern, the caches for NativePrivilegeStore are now migrated and managed by the centrialied CacheInvalidatorRegistry. The clear privilege cache rest and transport actions are also updated to use the new implementation. The Request/Response classes are intentionally untouched and plan to have a separate PR to address them because they would involve user visible deprecation changes. Co-authored-by: Albert Zaharovits <albert.zaharovits@elastic.co>

This PR is a follow-up refactoring for #59376. It replaces InvalidationCountingCacheWrapper with a simpler and more resuable LockingAtomicCounter. The change includes extracting the logic of "minimizing chance of caching stale results" into a new class, which is now generic and not tied to any cache related operations. This in turn allows it to be reused by different caches, which often have subtle, important and anonying differences that make it impossible (or at least cubersome and painful) to have a generic solution. So this change of reducing the size of reusable code is the right move. As an example of applying this new pattern, the caches for NativePrivilegeStore are now migrated and managed by the centrialied CacheInvalidatorRegistry. The clear privilege cache rest and transport actions are also updated to use the new implementation. The Request/Response classes are intentionally untouched and plan to have a separate PR to address them because they would involve user visible deprecation changes. Co-authored-by: Albert Zaharovits <albert.zaharovits@elastic.co>

ywangd added >enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) v8.0.0 v7.9.0 labels Jul 13, 2020

ywangd requested a review from tvernum July 13, 2020 00:30

ywangd force-pushed the es-53940-apikey-getdoc-avoidance branch from 8fa3869 to 32a7a31 Compare July 15, 2020 01:27

WIP: add cache for API Key document.

7501b05

ywangd force-pushed the es-53940-apikey-getdoc-avoidance branch from 32a7a31 to 7501b05 Compare July 20, 2020 06:09

ywangd added 3 commits August 3, 2020 15:48

Merge remote-tracking branch 'origin/master' into es-53940-apikey-get…

97e1318

…doc-avoidance

Merge remote-tracking branch 'origin/master' into es-53940-apikey-get…

c166c13

…doc-avoidance

WIP: Have one cache for API key doc and auth result.

df2a1d8

ywangd added 5 commits August 12, 2020 14:17

Revert "WIP: Have one cache for API key doc and auth result."

e20dad8

This reverts commit df2a1d8.

Add security cache common behaviours and handlers

f425470

checkstyle

5536b88

Fix for no cache scenario

05dc6ae

Merge remote-tracking branch 'origin/master' into es-53940-apikey-get…

c5c690b

…doc-avoidance

ywangd commented Aug 13, 2020

View reviewed changes

...n/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityCacheRegistry.java Outdated Show resolved Hide resolved

ywangd added 2 commits September 4, 2020 13:04

Merge remote-tracking branch 'origin/master' into es-53940-apikey-get…

9e62e81

…doc-avoidance

Simplify security cache based on feedback.

a0b3bd0

Merge remote-tracking branch 'origin/master' into es-53940-apikey-get…

33a7553

…doc-avoidance

tvernum reviewed Sep 8, 2020

View reviewed changes

ywangd and others added 2 commits September 9, 2020 00:25

Apply suggestions from code review

aafed3f

Co-authored-by: Tim Vernum <tim@adjective.org>

feedback

283d1f5

ywangd added 3 commits October 4, 2020 23:58

Add cache clear call after key invalidation

748ddbb

checkstyle

52d793a

More check style

f3b683e

ywangd requested a review from albertzaharovits October 4, 2020 13:09

ywangd added 5 commits October 5, 2020 09:39

Fix tests

720ee2a

Refactor to better manage the secondary role descriptor cache

52e4d6b

null check optimize

901e85f

Fix tests

70ab17b

checkstyle

4cbffa3

ywangd mentioned this pull request Oct 5, 2020

Bulk invalidate API keys using a list of IDs #63224

Merged

ywangd added 4 commits October 5, 2020 13:20

Fix tests

ac8ca35

Address feedback - add dedicated rest action for clear api key cache

11e591b

Merge remote-tracking branch 'origin/master' into es-53940-apikey-get…

fe09225

…doc-avoidance

Remove clear security request base class from client code

fb29d52

tvernum approved these changes Oct 6, 2020

View reviewed changes

ywangd and others added 3 commits October 6, 2020 14:47

Apply suggestions from code review

e1a3b40

Co-authored-by: Tim Vernum <tim@adjective.org>

Apply suggestions from code review

cefed1a

Co-authored-by: Tim Vernum <tim@adjective.org>

Address feedback

ef95a1d

albertzaharovits approved these changes Oct 6, 2020

View reviewed changes

ywangd merged commit 9e1f912 into elastic:master Oct 6, 2020

ywangd mentioned this pull request Dec 15, 2020

Improve reusability of the invalidation counting cache wrapper #66310

Merged

ywangd mentioned this pull request Apr 19, 2021

Improve reusability of the cache invalidation counter (#66310) #71802

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Conversation

ywangd commented Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywangd commented Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywangd commented Jul 13, 2020

Uh oh!

tvernum commented Jul 13, 2020

Uh oh!

ywangd commented Jul 13, 2020

Uh oh!

ywangd commented Aug 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ywangd commented Aug 13, 2020

Uh oh!

ywangd commented Sep 7, 2020

Uh oh!

tvernum Sep 8, 2020

Choose a reason for hiding this comment

Uh oh!

ywangd Sep 8, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ywangd commented Oct 4, 2020

Uh oh!

tvernum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tvernum Oct 6, 2020

Choose a reason for hiding this comment

Uh oh!

ywangd Oct 6, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tvernum Oct 6, 2020

Choose a reason for hiding this comment

Uh oh!

ywangd Oct 6, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albertzaharovits left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ywangd commented Jul 13, 2020 •

edited

Loading

ywangd commented Jul 13, 2020 •

edited

Loading

ywangd commented Aug 11, 2020 •

edited

Loading