Fix concurrent refresh of tokens#55114
Conversation
Our handling for cncurrent refresh of access tokens suffered from a race condition where: 1. Thread A has just finished with updating the existing token document, but hasn't stored the new tokens in a new document yet 2. Thread B attempts to refresh the same token and since the original token document is marked as refreshed, it decrypts and gets the new access token and refresh token and returns that to the caller of the API. 3. The caller attempts to use the newly refreshed access token immediately and gets an authentcation error since Thread a still hasn't finished writing the document. This commit changes the behavior so that Thread B, would first try to do a Get request for the token document where it expects that the access token it decrypted is stored(with exponential backoff) and will not respond until it can verify that it reads it in the tokens index. That ensures that we only ever return tokens in a response if they are already valid and can be used immediately It also adjusts TokenAuthIntegTests#testRefreshingMultipleTimesWithinWindowSucceeds to test authenticating with the tokens each thread receives, which fails without the fix. Resolves: elastic#54289
|
Pinging @elastic/es-security (:Security/Authentication) |
ywangd
left a comment
There was a problem hiding this comment.
A few minor comments. Otherwise LGTM
| if (backoff.hasNext()) { | ||
| logger.info("could not get token document [{}] that should have been created, retrying", tokenDocId); | ||
| client.threadPool().schedule(() -> getTokenDocAsync(tokenDocId, tokensIndex, this), | ||
| backoff.next(), GENERIC); | ||
| } else { | ||
| logger.warn("could not get token document [{}] that should have been created after all retries", | ||
| tokenDocId); | ||
| onFailure.accept(invalidGrantException("could not refresh the requested token")); | ||
| } |
There was a problem hiding this comment.
Maybe extract this logic into a consumer? It can be reused for the onReponse part as well.
| assertThat(failed.get(), equalTo(false)); | ||
| // Assert that we only ever got one access_token/refresh_token pair | ||
| assertThat(tokens.stream().distinct().collect(Collectors.toList()).size(), equalTo(1)); | ||
| // Assert that we only ever got one anot(contccess_token/refresh_token pair |
There was a problem hiding this comment.
anot(contccess_token - probably some random extra keystrokes?
| assertThat((int) tokens.stream().distinct().count(), equalTo(1)); | ||
| // Assert that all requests from all threads could authenticate at the time they received the access token | ||
| // see: https://github.com/elastic/elasticsearch/issues/54289 | ||
| assertThat(authStatuses, not(hasItem(RestStatus.UNAUTHORIZED))); |
There was a problem hiding this comment.
nits: alternatively, maybe we could verify all statuses are OK since getAuthenticationResponseCode always returns OK when no exception is thrown. It feels more precise.
There was a problem hiding this comment.
I was thinking in terms of testing for the bug this is solving but these all should be OK so I see your point, will address
|
Ping @tvernum |
| onFailure.accept(invalidGrantException("could not refresh the requested token")); | ||
| } | ||
| }; | ||
| getTokenDocAsync(tokenDocId, tokensIndex, new ActionListener<>() { |
There was a problem hiding this comment.
I think this, and the call above should explicitly not fetch the source since we only care about an exists check.
There was a problem hiding this comment.
Yes, agreed ! Thanks
Our handling for concurrent refresh of access tokens suffered from a race condition where: 1. Thread A has just finished with updating the existing token document, but hasn't stored the new tokens in a new document yet 2. Thread B attempts to refresh the same token and since the original token document is marked as refreshed, it decrypts and gets the new access token and refresh token and returns that to the caller of the API. 3. The caller attempts to use the newly refreshed access token immediately and gets an authentication error since thread A still hasn't finished writing the document. This commit changes the behavior so that Thread B, would first try to do a Get request for the token document where it expects that the access token it decrypted is stored(with exponential backoff ) and will not respond until it can verify that it reads it in the tokens index. That ensures that we only ever return tokens in a response if they are already valid and can be used immediately It also adjusts TokenAuthIntegTests to test authenticating with the tokens each thread receives, which would fail without the fix. Resolves: elastic#54289
Our handling for concurrent refresh of access tokens suffered from a race condition where: 1. Thread A has just finished with updating the existing token document, but hasn't stored the new tokens in a new document yet 2. Thread B attempts to refresh the same token and since the original token document is marked as refreshed, it decrypts and gets the new access token and refresh token and returns that to the caller of the API. 3. The caller attempts to use the newly refreshed access token immediately and gets an authentication error since thread A still hasn't finished writing the document. This commit changes the behavior so that Thread B, would first try to do a Get request for the token document where it expects that the access token it decrypted is stored(with exponential backoff ) and will not respond until it can verify that it reads it in the tokens index. That ensures that we only ever return tokens in a response if they are already valid and can be used immediately It also adjusts TokenAuthIntegTests to test authenticating with the tokens each thread receives, which would fail without the fix. Resolves: #54289
Our handling for concurrent refresh of access tokens suffered from a race condition where: 1. Thread A has just finished with updating the existing token document, but hasn't stored the new tokens in a new document yet 2. Thread B attempts to refresh the same token and since the original token document is marked as refreshed, it decrypts and gets the new access token and refresh token and returns that to the caller of the API. 3. The caller attempts to use the newly refreshed access token immediately and gets an authentication error since thread A still hasn't finished writing the document. This commit changes the behavior so that Thread B, would first try to do a Get request for the token document where it expects that the access token it decrypted is stored(with exponential backoff ) and will not respond until it can verify that it reads it in the tokens index. That ensures that we only ever return tokens in a response if they are already valid and can be used immediately It also adjusts TokenAuthIntegTests to test authenticating with the tokens each thread receives, which would fail without the fix. Resolves: elastic#54289
Our handling for concurrent refresh of access tokens suffered from a race condition where: 1. Thread A has just finished with updating the existing token document, but hasn't stored the new tokens in a new document yet 2. Thread B attempts to refresh the same token and since the original token document is marked as refreshed, it decrypts and gets the new access token and refresh token and returns that to the caller of the API. 3. The caller attempts to use the newly refreshed access token immediately and gets an authentication error since thread A still hasn't finished writing the document. This commit changes the behavior so that Thread B, would first try to do a Get request for the token document where it expects that the access token it decrypted is stored(with exponential backoff ) and will not respond until it can verify that it reads it in the tokens index. That ensures that we only ever return tokens in a response if they are already valid and can be used immediately It also adjusts TokenAuthIntegTests to test authenticating with the tokens each thread receives, which would fail without the fix. Resolves: #54289
@ywangd made an awesome analysis on why this test is failing, over at elastic#55816 (comment) This change makes it so that we use the same client to perform a refresh of a token, as we use to subsequently attempt to authenticate with the refreshed token. This ensures the tests are failing and is a good approximation of how we expect the same client doing the refresh, to also perform the subsequent authentication in real life uses. The errors we were seeing from users have dissappeared after elastic#55114 so we deem our behavior safe.
@ywangd made an awesome analysis on why this test is failing, over at elastic#55816 (comment) This change makes it so that we use the same client to perform a refresh of a token, as we use to subsequently attempt to authenticate with the refreshed token. This ensures the tests are failing and is a good approximation of how we expect the same client doing the refresh, to also perform the subsequent authentication in real life uses. The errors we were seeing from users have disappeared after elastic#55114 so we deem our behavior safe.
@ywangd made an awesome analysis on why this test is failing, over at #55816 (comment) This change makes it so that we use the same client to perform a refresh of a token, as we use to subsequently attempt to authenticate with the refreshed token. This ensures the tests are failing and is a good approximation of how we expect the same client doing the refresh, to also perform the subsequent authentication in real life uses. The errors we were seeing from users have dissappeared after #55114 so we deem our behavior safe.
@ywangd made an awesome analysis on why this test is failing, over at #55816 (comment) This change makes it so that we use the same client to perform a refresh of a token, as we use to subsequently attempt to authenticate with the refreshed token. This ensures the tests are failing and is a good approximation of how we expect the same client doing the refresh, to also perform the subsequent authentication in real life uses. The errors we were seeing from users have disappeared after #55114 so we deem our behavior safe.
Our handling for concurrent refresh of access tokens suffered from
a race condition where:
document, but hasn't stored the new tokens in a new document
yet
original token document is marked as refreshed, it decrypts and
gets the new access token and refresh token and returns that to
the caller of the API.
immediately and gets an authentication error since Thread a still
hasn't finished writing the document.
This commit changes the behavior so that Thread B, would first try
to do a Get request for the token document where it expects that
the access token it decrypted is stored(with exponential backoff)
and will not respond until it can verify that it reads it in the
tokens index. That ensures that we only ever return tokens in a
response if they are already valid and can be used immediately
It also adjusts
TokenAuthIntegTests#testRefreshingMultipleTimesWithinWindowSucceeds
to test authenticating with the tokens each thread receives,
which fails without the fix.
Resolves: #54289