Skip to content

Timing out stale remote master history#86936

Merged
masseyke merged 9 commits intoelastic:masterfrom
masseyke:fix/age-out-remote-master-history
May 24, 2022
Merged

Timing out stale remote master history#86936
masseyke merged 9 commits intoelastic:masterfrom
masseyke:fix/age-out-remote-master-history

Conversation

@masseyke
Copy link
Copy Markdown
Member

This change makes it so that the remote master history that has been fetched will time out after 5 minutes (by default). The MasterHistoryService will return its value as null if it is older than the configured timeout. This way MasterHistoryService clients don't report status based on out of date information.

@masseyke masseyke added >non-issue :Distributed/Health Issues for the health report API v8.3.0 labels May 19, 2022
@masseyke masseyke marked this pull request as ready for review May 19, 2022 18:56
@elasticmachine elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label May 19, 2022
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith.

This generally LGTM - just left a few questions

*/
public static final Setting<TimeValue> MAX_USABLE_REMOTE_HISTORY_AGE_SETTING = Setting.timeSetting(
"master_history.max_usable_remote_history_age",
DEFAULT_MAX_USABLE_REMOTE_HISTORY_AGE,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this mandate a min-value?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think it ought to be? I thought briefly about it, and then thought maybe there would be some case where we would want to tell a user to set it to 0 or negative to basically skip checking remote master history (probably not, but we might be kicking ourselves later if we put in some arbitrary minimum).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use positiveTimeSetting to at least not worry about negative values?

volatile RemoteHistoryOrException remoteHistoryOrException = new RemoteHistoryOrException(null, null, Long.MIN_VALUE);
private static final Logger logger = LogManager.getLogger(MasterHistoryService.class);

private static final TimeValue DEFAULT_MAX_USABLE_REMOTE_HISTORY_AGE = new TimeValue(5, TimeUnit.MINUTES);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "usable" a bit redundant in the name? Would going with "time to live" be a bit more intuitive? (given the already established understanding in caching technologies) e.g. "remote_history_time_to_live" ?

*/
long acceptableRemoteHistoryTime = currentTimeMillisSupplier.getAsLong() - acceptableRemoteHistoryAge.getMillis();
if (remoteHistoryOrExceptionCopy.creationTimestamp < acceptableRemoteHistoryTime) {
return null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to also at some point null-ify remoteHistoryOrException ? (we'd likely need an AtomicReference or such to do a compare and swap)

What do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that there was no harm in holding onto a single one, and it avoids having to synchronize access to it. Synchronizing it would not be super expensive (we'd need to read it, do the time calculation, and set it to null in an atomic action), but it seemed like an unnecessary complication. Is it the memory that you're concerned about? We cap these things to have at most 50 entries, so I'd expect at most a few KB here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly worried about the remoteHistoryOrException leaking outside the service by other means than getRemoteMasterHistory or the service evolving and using the remoteHistoryOrException in a private way (assuming it's null when it's stale, which would not be a wild assumption given what the getter returns) - @DaveCTurner would you have a strong opinion here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That second concern would maybe mean having a thread to time this thing out. That seems like something we ought to deal with if we really need it in the future, but seems overly complex for the current situation doesn't it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to leave it in place too. You never know, it might even end up being useful in a heap dump. Maybe add a comment to the field indicating that it might be stale.

For future reference it wouldn't be a big deal to clear it after the timeout either:

        transportService.getThreadPool().schedule(() -> {/* clear field*/}, acceptableRemoteHistoryAge, Names.SAME);

(also there's no need for a ThreadPool constructor argument if you have a TransportService)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ let's leave it as is then

@masseyke
Copy link
Copy Markdown
Member Author

@elasticmachine run elasticsearch-ci/part-1

@masseyke masseyke requested a review from andreidan May 23, 2022 18:15

record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception) { // non-private for testing
// non-private for testing
record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimestamp) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimestamp) {
record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimeMillis) {

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a minor rename suggestion

Thanks Keith


RemoteHistoryOrException(List<DiscoveryNode> remoteHistory) {
this(remoteHistory, null);
RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, long creationTimestamp) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename the creationTimestamp to include the unit everywhere? ( https://github.com/elastic/elasticsearch/pull/86936/files#r880273628 )

@masseyke
Copy link
Copy Markdown
Member Author

@elasticmachine run elasticsearch-ci/part-1

@masseyke
Copy link
Copy Markdown
Member Author

@elasticmachine run elasticsearch-ci/part-2

@masseyke masseyke merged commit 3630a14 into elastic:master May 24, 2022
@masseyke masseyke deleted the fix/age-out-remote-master-history branch May 24, 2022 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Health Issues for the health report API >non-issue Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v8.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants