Skip to content

Re-build remote cluster connections on credential changes#103460

Merged
elasticsearchmachine merged 76 commits intoelastic:mainfrom
n1v0lg:rcs2-rebuild-connections
Jan 11, 2024
Merged

Re-build remote cluster connections on credential changes#103460
elasticsearchmachine merged 76 commits intoelastic:mainfrom
n1v0lg:rcs2-rebuild-connections

Conversation

@n1v0lg
Copy link
Copy Markdown
Contributor

@n1v0lg n1v0lg commented Dec 14, 2023

This PR builds on #102798 by adding automatic remote connection rebuilding on cluster credentials changes. In particular, we rebuild a remote cluster connection if a credential for the associated cluster is newly added (i.e., we are moving from RCS 1.0 -> RCS 2.0) or removed (moving from RCS 2.0 -> RCS 1.0). A connection rebuild allows us to associate the correct profile (_remote_server in case of RCS 2.0, or "regular" transport profile for RCS 1.0) without requiring end-users to manually update remote cluster settings via a settings update call.

More context on connection rebuilding also in this comment.

Relates: ES-6764

n1v0lg and others added 30 commits December 7, 2023 15:43
…elastic/elasticsearch into revert-103211-revert-102798-rcs2-reload
Copy link
Copy Markdown
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the tests yet, but looking good ... a couple comments.

// We avoid stashing and marking context as system to keep the action as minimal as possible (i.e., avoid copying context)
remoteClusterService.updateRemoteClusterCredentials(request.getSettings());
listener.onResponse(ActionResponse.Empty.INSTANCE);
assert Transports.assertNotTransportThread("Remote connection re-building is too much for a transport thread");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ! TIL

);
assert future.isDone() : "expecting local-only action call to return immediately on invocation";
future.actionGet(0, TimeUnit.NANOSECONDS);
future.actionGet(10, TimeUnit.SECONDS);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want a timeout here. If the timeout is hit it will not cancel the underlying work and other than returning sooner not sure what that buys us. It only encourages calling the API again which can compound the already unhealthy scenerio (and we will never be able to guess a correct upper bound unless it is huge). If anything the per-connection could have a timeout/retry ...but if does not already i don't see a need to add one now. i.e. use the explicit listener variant instead of the implicit listener used in the future to make the call to the action.

return;
}

final PlainActionFuture<ActionResponse.Empty> future = new PlainActionFuture<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we short circuit this entirely if the credentials did not change ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky: just the absence of credentials in the settings doesn't mean a noop (it could be that we removed credentials) -- anything else would require access to the remote cluster service which is not available here. I wanted to add a short-circuit here as well originally but didn't come up with a good way to do it.

GroupedActionListener<Void> groupedListener
) {
if (remoteClusters.containsKey(clusterAlias)) {
updateRemoteCluster(clusterAlias, settings, true, ActionListener.wrap(status -> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this handle tearing down a connection after removing a credential ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a connection exists, and we remove a credential, updateRemoteCluster handles re-building it. If the connection does not exist, that means cluster settings for it were removed, and updateRemoteCluster will tear it down on the regular settings update call that removed the cluster settings. So if a connection does not exist anymore, and but a credential does still and is removed, there is nothing left to tear down. Let me know if I misinterpreted the question!

Copy link
Copy Markdown
Contributor

@jfreden jfreden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very solid work on this PR! It looks great. Easy to follow, good tests and a lot of thought has been put in to covering corner cases. I only have some optional comments.

One thing I couldn't really wrap my head around is if the reload of credentials is graceful or not? What happens to ongoing CCR and CCS requests when the credentials are reloaded? Are they just going to fail until the new connection is up? Maybe that's fine.

final GroupedActionListener<Boolean> groupedListener = new GroupedActionListener<>(
totalConnectionsToRebuild,
listener.map(successFlags -> {
logger.info("rebuild complete for [{}] connections after credentials update", successFlags.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I like the GroupedActionListener.

Strings.collectionToCommaDelimitedString(clusterCredentials.keySet())
)
);
public final synchronized UpdateRemoteClusterCredentialsResult updateClusterCredentials(Settings settings) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only called from a single-threaded context. Might consider keeping it synchronized for future proofing, but it adds some overhead to acquire an extra lock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes great point. I meant to take a closer look at being less heavy handed with synchronization but didn't get to it. Thanks for prompting me to think about this again. I agree that with the current structure, we can remove synchronization here. I'll think on it and tweak some things 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this on Slack also:

  • I prefer to keep this one synchronized for future proofing
  • I removed sorting the cluster aliases as that adds unnecessary overhead without utility -- originally, I used sorting with the notion to prevent deadlocks in RemoteClusterService::updateRemoteClusterCredentials (since we iterate over aliases and call a synchronized method for each, we have a potential deadlock). However updateRemoteClusterCredentials is itself syncronized at the top level, so we're safe.

Thanks for raising and discussing this 👍

remoteClusters.put(clusterAlias, remote);
remote.ensureConnected(listener.map(ignored -> RemoteClusterConnectionStatus.CONNECTED));
} else if (remote.shouldRebuildConnection(newSettings)) {
} else if (forceRebuild || remote.shouldRebuildConnection(newSettings)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since we know that this will always be true when we reload credentials, it might make sense to break this in to its own method to avoid the boolean in the method signature.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed that the flag is not ideal but having stared at the method for a bit I don't see a good way to refactor it without basically duplicating it or using some sort of boolean flag in the end -- did you have a specific tweak in mind? partly, I wanted to avoid touching too much of the connection rebuilding code since it's fairly complex and outside of the security team's scope.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! It's fairly complex so I think it's out of scope to refactor any of that. 👍

@@ -20,18 +22,80 @@ public void testResolveRemoteClusterCredentials() {
final String clusterAlias = randomAlphaOfLength(9);
final String otherClusterAlias = randomAlphaOfLength(10);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the CredentialsManager is expected to be thread safe, a concurrency test could be useful in this class.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get around to this yet -- I think a concurrency test for RemoteClusterService::updateRemoteClusterCredentials would give us the most useful coverage. I think it's a really nice-to-add test but don't want to defer another round of review for it. My plan is to add it still.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll log a follow up item to address this one in Jira to not block getting this out the door.


import static org.hamcrest.Matchers.containsInAnyOrder;

// account for slow stored secure settings updates (involves removing and re-creating the keystore)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be interesting with a test case that blasts the API with concurrent reload requests combined with config updates.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. it wouldn't work with a REST level test because of the keystore file rewrite but I'll see if there is something lower-level we can do

@n1v0lg
Copy link
Copy Markdown
Contributor Author

n1v0lg commented Dec 22, 2023

@jfreden thanks for the review!

One thing I couldn't really wrap my head around is if the reload of credentials are graceful or not? What happens to ongoing CCR and CCS requests when the credentials are reloaded? Are they just going to fail until the new connection is up?

Not very graceful, indeed, and may cause CCS/CCR to fail. This is existing behavior, in that you can also trigger a connection rebuild by updating remote cluster settings. We recommend stopping long-running cross cluster operations in our migration guide: https://www.elastic.co/guide/en/elasticsearch/reference/current/remote-clusters-migrate.html#remote-clusters-migration-stop to avoid these sorts of failures.

try {
final PlainActionFuture<RemoteClusterConnectionStatus> future = new PlainActionFuture<>();
updateRemoteCluster(clusterAlias, settings, true, future);
final RemoteClusterConnectionStatus status = future.actionGet(10, TimeUnit.SECONDS);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like stacking the timeouts one after the other like this. If we have, say, 150 remotes, and they all hit the 10s timeout, then with this logic we'll be waiting for 1500s = 25 minutes. If we need to time out at all then we should have one top-level timeout, and TBH I'd rather it was in the caller seeing as how we don't do anything (except logging) when we hit the timeout anyway.

I know we have this problem when applying cluster states too, and IMO that's a bug. It's tricky to solve in that case because of the mechanics of how settings are applied, but I would rather avoid making the same mistake here too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I don't feel strongly about the per-connection timeout (nor the top-level timeout for that matter).

If we have 150 remotes, and each connection takes >10s to rebuild the overall request will still wait for 25m. A top level timeout could prevent this but it's difficult to pick a reasonable default since I expect timeouts to vary greatly based on the network situation for a given connection (@jakelandis points this out as here which prompted me to consider a per-connection timeout) -- e.g., a healthy connection will rebuild in >1s but a slow/unstable network could result in long waits.

It could be nice to add a timeout parameter to the initial, generic reload-secure-settings request but we have no way of passing this down to the individual plugin reload call, since the reload() call only takes settings as input and I would prefer not to change the ReloadablePlugin interface here since that's a bigger change that requires separate input and deliberation.

The only other alternative that comes to mind is adding a dedicated setting where end-users can set a connection rebuild timeout but that feels hacky and like an overkill since, as you point out, we don't do anything (except logging) when we hit the timeout anyway.

Given that it's hard to come up with a generic top-level timeout, and that individual connection timeouts can give us very high (and pretty arbitrary) wait times, I prefer to remove the timeout altogether for now and possibly revisit this in the future. @DaveCTurner @jakelandis let me know if you have objections.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy with no timeout here I think, especially if we trigger all the connection attempts in parallel since there's other timeouts in play lower down the stack, preventing this from really waiting forever.

I do think it's worth reconsidering the synchronous nature of the reload() API, which forces us to call each plugin in turn (or do something worse like spawning a thread per plugin). Stuff like this should be async all the way down the stack IMO and then these concerns wouldn't really arise.

@n1v0lg n1v0lg requested a review from jakelandis January 8, 2024 17:11
Copy link
Copy Markdown
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work !


public void testUpgradeFromRcs1() throws Exception {
// Setup RCS 1.0 and check that it works
configureRemoteCluster("my_remote_cluster", fulfillingCluster, true, randomBoolean(), randomBoolean());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would it be possible to assert no API key configured for my_remote_cluster ? Just a bit paraniod about ensuring this is really RCS 1.0. If it is anything but trivial, feel free to ignore this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configureRemoteCluster calls checkRemoteConnection under the hood which ensures that no credential is configured when basicSecurity is true here

Checking the keystore would be a bit tricky, off the top of my head. Hope this is sufficient!

assumeFalse(
"Cannot run in FIPS mode since the keystore will be password protected and sending a password in the reload"
+ "settings api call, requires TLS to be configured for the transport layer",
inFipsJvm()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires TLS to be configured for the transport layer

where is this enforced ? Also, isn't TLS enabled for the tranport level in these tests ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you log an issue for follow up. Ideally we wouldn't skip these in FIPS mode.

Copy link
Copy Markdown
Contributor Author

@n1v0lg n1v0lg Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch! The message is a copy-paste hoopla. FIPS is still a problem though, around insufficient password length for the keystore password. I'll tweak the message and log a follow up to address this.

@jakelandis jakelandis added the :Security/FIPS Running ES in FIPS 140-2 mode label Jan 10, 2024
@n1v0lg
Copy link
Copy Markdown
Contributor Author

n1v0lg commented Jan 11, 2024

@elasticmachine update branch

@n1v0lg n1v0lg added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 11, 2024
@n1v0lg
Copy link
Copy Markdown
Contributor Author

n1v0lg commented Jan 11, 2024

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit f84bda7 into elastic:main Jan 11, 2024
@n1v0lg n1v0lg deleted the rcs2-rebuild-connections branch January 11, 2024 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >non-issue :Security/FIPS Running ES in FIPS 140-2 mode :Security/Security Security issues without another label Team:Security Meta label for security team test-update-serverless v8.13.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants