Timing out stale remote master history#86936
Conversation
|
Pinging @elastic/es-data-management (Team:Data Management) |
andreidan
left a comment
There was a problem hiding this comment.
Thanks for working on this Keith.
This generally LGTM - just left a few questions
| */ | ||
| public static final Setting<TimeValue> MAX_USABLE_REMOTE_HISTORY_AGE_SETTING = Setting.timeSetting( | ||
| "master_history.max_usable_remote_history_age", | ||
| DEFAULT_MAX_USABLE_REMOTE_HISTORY_AGE, |
There was a problem hiding this comment.
Should this mandate a min-value?
There was a problem hiding this comment.
What do you think it ought to be? I thought briefly about it, and then thought maybe there would be some case where we would want to tell a user to set it to 0 or negative to basically skip checking remote master history (probably not, but we might be kicking ourselves later if we put in some arbitrary minimum).
There was a problem hiding this comment.
Shall we use positiveTimeSetting to at least not worry about negative values?
| volatile RemoteHistoryOrException remoteHistoryOrException = new RemoteHistoryOrException(null, null, Long.MIN_VALUE); | ||
| private static final Logger logger = LogManager.getLogger(MasterHistoryService.class); | ||
|
|
||
| private static final TimeValue DEFAULT_MAX_USABLE_REMOTE_HISTORY_AGE = new TimeValue(5, TimeUnit.MINUTES); |
There was a problem hiding this comment.
Is "usable" a bit redundant in the name? Would going with "time to live" be a bit more intuitive? (given the already established understanding in caching technologies) e.g. "remote_history_time_to_live" ?
| */ | ||
| long acceptableRemoteHistoryTime = currentTimeMillisSupplier.getAsLong() - acceptableRemoteHistoryAge.getMillis(); | ||
| if (remoteHistoryOrExceptionCopy.creationTimestamp < acceptableRemoteHistoryTime) { | ||
| return null; |
There was a problem hiding this comment.
Would we want to also at some point null-ify remoteHistoryOrException ? (we'd likely need an AtomicReference or such to do a compare and swap)
What do you think?
There was a problem hiding this comment.
My thinking was that there was no harm in holding onto a single one, and it avoids having to synchronize access to it. Synchronizing it would not be super expensive (we'd need to read it, do the time calculation, and set it to null in an atomic action), but it seemed like an unnecessary complication. Is it the memory that you're concerned about? We cap these things to have at most 50 entries, so I'd expect at most a few KB here.
There was a problem hiding this comment.
I was mostly worried about the remoteHistoryOrException leaking outside the service by other means than getRemoteMasterHistory or the service evolving and using the remoteHistoryOrException in a private way (assuming it's null when it's stale, which would not be a wild assumption given what the getter returns) - @DaveCTurner would you have a strong opinion here?
There was a problem hiding this comment.
That second concern would maybe mean having a thread to time this thing out. That seems like something we ought to deal with if we really need it in the future, but seems overly complex for the current situation doesn't it?
There was a problem hiding this comment.
I am inclined to leave it in place too. You never know, it might even end up being useful in a heap dump. Maybe add a comment to the field indicating that it might be stale.
For future reference it wouldn't be a big deal to clear it after the timeout either:
transportService.getThreadPool().schedule(() -> {/* clear field*/}, acceptableRemoteHistoryAge, Names.SAME);
(also there's no need for a ThreadPool constructor argument if you have a TransportService)
There was a problem hiding this comment.
++ let's leave it as is then
|
@elasticmachine run elasticsearch-ci/part-1 |
|
|
||
| record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception) { // non-private for testing | ||
| // non-private for testing | ||
| record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimestamp) { |
There was a problem hiding this comment.
| record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimestamp) { | |
| record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimeMillis) { |
andreidan
left a comment
There was a problem hiding this comment.
LGTM, with a minor rename suggestion
Thanks Keith
|
|
||
| RemoteHistoryOrException(List<DiscoveryNode> remoteHistory) { | ||
| this(remoteHistory, null); | ||
| RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, long creationTimestamp) { |
There was a problem hiding this comment.
Can we rename the creationTimestamp to include the unit everywhere? ( https://github.com/elastic/elasticsearch/pull/86936/files#r880273628 )
|
@elasticmachine run elasticsearch-ci/part-1 |
|
@elasticmachine run elasticsearch-ci/part-2 |
This change makes it so that the remote master history that has been fetched will time out after 5 minutes (by default). The MasterHistoryService will return its value as
nullif it is older than the configured timeout. This way MasterHistoryService clients don't report status based on out of date information.