Skip to content

Cache API key doc to reduce traffic to the security index#59376

Merged
ywangd merged 45 commits intoelastic:masterfrom
ywangd:es-53940-apikey-getdoc-avoidance
Oct 6, 2020
Merged

Cache API key doc to reduce traffic to the security index#59376
ywangd merged 45 commits intoelastic:masterfrom
ywangd:es-53940-apikey-getdoc-avoidance

Conversation

@ywangd
Copy link
Copy Markdown
Member

@ywangd ywangd commented Jul 13, 2020

Getting the API key document form the security index is the most time consuing part of the authentication flow (>60% if index is local and >90% if index is remote). This traffic can be avoided with cache. Initially we also plan to avoid deserialising source of GetApiKeyDocument response if "primary term" and "sequence number" are unchanged. However, given the document fetching itself is dominant in execution time, it does not seem to be worthwhile to optimise for deserialisation. Hence, for uncached document, the cache will fetch the document and also simply deserialise it without further checking.

When a large number of API keys are in use, we expect most of them are created with exactly the same role-descriptors and limited-by-role-descriptors. It is wasteful to cache these identical role descriptors repeatitively for different API keys. Hence a two layer caches are put in place:

  • The first cache is keyed by docId and values are content of the API key document, except that the role-descriptors and limited-by-role-descriptors are replacecd by their sha256 hashes respectively.
  • The second cache is keyed by the above sha256 hashes and values are the corresponding role-descriptors.

The lifecyces of the above two caches are mostly tied together and they need work together in a way to ensure least possibility of stale results. Hence a new SecurityCaches class is added and it implements all necessary handlings as discussed in #55836.

@tvernum This is a quick attempt to implement a cache for API key document. It intentionally omits some parts, e.g. tests, configuration, to get a quick turnaround for proving the idea. It is easier to discuss when we have something concrete. We could try formalise it for v7.9 if feasible. But I am also OK to delay it if more careful evaluation is needed. Thanks!

Resolves: #53940

@ywangd ywangd added >enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) v8.0.0 v7.9.0 labels Jul 13, 2020
@ywangd ywangd requested a review from tvernum July 13, 2020 00:30
@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Jul 13, 2020

Please note this PR is built upon the yet-to-be-approved #58156. Therefeore many changes are actually not relevant to this PR itself. Here are a list of actual relevant changes:

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Jul 13, 2020

@elasticmachine run elasticsearch-ci/1

@tvernum
Copy link
Copy Markdown
Contributor

tvernum commented Jul 13, 2020

I don't think it's worth pushing this in before FF.
It's a good change, but I think we need time to discuss things such the split doc/auth caches.

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Jul 13, 2020

I think we need time to discuss things such the split doc/auth caches.

Thanks, I agree. The changes are still rough. I also thought about merging doc and auth caches but didn't do it mainly because of complexity and potential implications of complex changes. It requires more time to get right.
Also a big part of the change is about refactoring the caching behaviour into a separate class, which probably deserves its own PR.

@ywangd ywangd force-pushed the es-53940-apikey-getdoc-avoidance branch from 8fa3869 to 32a7a31 Compare July 15, 2020 01:27
@ywangd ywangd force-pushed the es-53940-apikey-getdoc-avoidance branch from 32a7a31 to 7501b05 Compare July 20, 2020 06:09
@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Aug 11, 2020

I spent some time look into how we could have a single cache for both API key doc and Auth hash. I came up with a version that works. However, it comes with high complexity because the two parts of data have different lifecycles.

With the single cache approach, a cache item contains both content of an API key document and the Authentication hash. These two parts can vary independantly. That is, when the document needs to be refreshed for caching, the authentication hash can still be valid because it is based on user-provided credentials and only associated with the document ID, not its content. Similarly, when the authentication hash needs to be updated, the document can still be valid because unlike authentication hash it does not change based on user-provided credentials. So what we need is a cache where the cache item can be partially invalidated. This requirement is basically asking for two separate caches. It is possible to make it work with a single cache as mentioned earlier, but it leads to high complexity due to the diverging requirement plus how it could work with a distributed system. Therefore, even though I already have an implementation for it, I still prefer the approrach of separate caches. Honestly, I had fun caming up with my current implementaiton and trying to catch all edge cases. But I am concerned that it is difficult to explain how it works and maintain it a few years from now.

After discussion with @tvernum, we decided to go with separate caches. There is no overlap between contents of the two caches. Hence we do not have concern in terms of duplication (memory consumption).

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Aug 13, 2020

@tvernum After a few back and forth, here is the version that I am settled with. It touches the idea of consolidation for cache implementaitons, though with a much less ambitious approach. In addition to adding a cache for API key document, it also provides two enhancements:

  • A SecurityCache class to encapsulate best practicies around caching expiry and stale results. It is fairly simple in that it does not try to be a high level generic solution to all cache requirements, which is difficult and the complexity is not worth it (I had a version with 5 type parameters). Thus it requires a bit of manual work when using it. We can improve it down the path, but for now I think it is reasonable start point.
  • A SecurityCacheRegistry that centralise the caching clearing logic for Rest/Transport handlers. Right now it is just used for API key caches, including the new API key doc cache and the existing auth cache. But I can see it to replace the handlers for role cache and privilege cache and possibly realm cache as well. These can be done in a separate PR if the approach here looks good to you. This possibility is also in line with your previous comment on the privilege cache PR.

The PR still needs a few more things:

  • Tests
  • Client side ecode
  • Documentation

But I'd appreciate your review on the approach before I polish it.

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Sep 7, 2020

@tvernum I updated the PR based on your suggestion with some name tweaks. The changes are basically to simplify the SecurityCache, which in fact is now replaced by ConsistentCache. The ConsistentCache#checkpoint method returns a Checkpoint interface that can be used to cache given key/value pair while minimizing the possibility of caching stale results.

The SecurityCacheRegistry is renamed to CacheInvalidatorRegistry since that is all it does, i.e. providing cache invalidation services to different caches.

Could you please reivew the current approach? I'll proceeds with adding tests etc once we can agree on the overall approach. Thanks!

.setMaximumWeight(maximumWeight)
.build();
this.roleDescriptorsBytesCache = CacheBuilder.<String, BytesReference>builder()
.setExpireAfterAccess(ttl)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about this.
It defaults to 24h, and we don 't have much in place to clean it up.
Should it really cache for that long? It's expiry after access, so 1h feels like it would be sufficient.

From my brief reading, once the doc cache expires (5m after last write) there will be no incoming references to these entries. They may get revived, but I don't think holding onto that memory for 24h is justified.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to 1h. It was set to 24h because

  • 24h is an existing setting, so felt easy to reuse. I understand technically it has different meaning. But the subtlety here feels irrelevant to end-users.
  • Unlike API key document cache, the role descriptor cache, in theory, never "expires" in that they will never be outdated. Sure the entries may no longer be needed, but the association between a hash and its original content can never be invalid. Also many API keys may share the same role descriptors, for example, the superuser (limiting) role. These made me feel it is plausible to cache the descriptors for longer. Because even when a Key is expired or invalidated or haven't checked in for a while, it is likey a new key will be replacing it and the new key will have the exact same role descriptors and thus reuse the cache entries.

With above being said, I don't think the benefit of caching for longer is anyway significant, since the main performance hit is the document retrieval itself. So I am OK to have it at a shorter duration.

ywangd and others added 2 commits September 9, 2020 00:25
Co-authored-by: Tim Vernum <tim@adjective.org>
@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Oct 4, 2020

@albertzaharovits I updated this PR per your review and discussion. It is ready for another look. Thanks!

Copy link
Copy Markdown
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I have a few suggestions (some of which are strong suggestions).

super.writeTo(out);
out.writeString(cacheName);
out.writeOptionalStringArray(keys);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would have serialization tests for every class that has a writeTo method.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a serialisation test.

final CacheInvalidator cacheInvalidator = cacheInvalidators.get(cacheName);
if (cacheInvalidator != null) {
cacheInvalidator.invalidate(keys);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of an else seems like dangerous leniency. Why do we silently pretend to invalidate a cache that doesn't exist?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point and there is no reason for this leniency. Add an else branch to throw exception and also relevant tests.

ywangd and others added 3 commits October 6, 2020 14:47
Co-authored-by: Tim Vernum <tim@adjective.org>
Co-authored-by: Tim Vernum <tim@adjective.org>
Copy link
Copy Markdown
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Good efforts on this one!

@ywangd ywangd merged commit 9e1f912 into elastic:master Oct 6, 2020
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Oct 6, 2020
)

Getting the API key document form the security index is the most time consuing part
of the API Key authentication flow (>60% if index is local and >90% if index is remote).
This traffic is now avoided by caching added with this PR.

Additionally, we add a cache invalidator registry so that clearing of different caches will
be managed in a single place (requires follow-up PRs).
ywangd added a commit that referenced this pull request Oct 6, 2020
…63319)

Getting the API key document form the security index is the most time consuing part
of the API Key authentication flow (>60% if index is local and >90% if index is remote).
This traffic is now avoided by caching added with this PR.

Additionally, we add a cache invalidator registry so that clearing of different caches will
be managed in a single place (requires follow-up PRs).
ywangd added a commit that referenced this pull request Apr 18, 2021
This PR is a follow-up refactoring for #59376. It replaces
InvalidationCountingCacheWrapper with a simpler and more resuable
LockingAtomicCounter.

The change includes extracting the logic of "minimizing chance of caching stale
results" into a new class, which is now generic and not tied to any cache
related operations. This in turn allows it to be reused by different caches,
which often have subtle, important and anonying differences that make it
impossible (or at least cubersome and painful) to have a generic solution. So
this change of reducing the size of reusable code is the right move.

As an example of applying this new pattern, the caches for NativePrivilegeStore
are now migrated and managed by the centrialied CacheInvalidatorRegistry. The
clear privilege cache rest and transport actions are also updated to use the
new implementation. The Request/Response classes are intentionally untouched
and plan to have a separate PR to address them because they would involve user
visible deprecation changes.

Co-authored-by: Albert Zaharovits <albert.zaharovits@elastic.co>
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Apr 19, 2021
This PR is a follow-up refactoring for elastic#59376. It replaces
InvalidationCountingCacheWrapper with a simpler and more resuable
LockingAtomicCounter.

The change includes extracting the logic of "minimizing chance of caching stale
results" into a new class, which is now generic and not tied to any cache
related operations. This in turn allows it to be reused by different caches,
which often have subtle, important and anonying differences that make it
impossible (or at least cubersome and painful) to have a generic solution. So
this change of reducing the size of reusable code is the right move.

As an example of applying this new pattern, the caches for NativePrivilegeStore
are now migrated and managed by the centrialied CacheInvalidatorRegistry. The
clear privilege cache rest and transport actions are also updated to use the
new implementation. The Request/Response classes are intentionally untouched
and plan to have a separate PR to address them because they would involve user
visible deprecation changes.

Co-authored-by: Albert Zaharovits <albert.zaharovits@elastic.co>
ywangd added a commit that referenced this pull request Apr 19, 2021
This PR is a follow-up refactoring for #59376. It replaces
InvalidationCountingCacheWrapper with a simpler and more resuable
LockingAtomicCounter.

The change includes extracting the logic of "minimizing chance of caching stale
results" into a new class, which is now generic and not tied to any cache
related operations. This in turn allows it to be reused by different caches,
which often have subtle, important and anonying differences that make it
impossible (or at least cubersome and painful) to have a generic solution. So
this change of reducing the size of reusable code is the right move.

As an example of applying this new pattern, the caches for NativePrivilegeStore
are now migrated and managed by the centrialied CacheInvalidatorRegistry. The
clear privilege cache rest and transport actions are also updated to use the
new implementation. The Request/Response classes are intentionally untouched
and plan to have a separate PR to address them because they would involve user
visible deprecation changes.

Co-authored-by: Albert Zaharovits <albert.zaharovits@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:Security Meta label for security team v7.10.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid deserialising source of the get API key response when there is no change

5 participants