Use lastSyncedGlobalCheckpoint in deletion policy#27826
Use lastSyncedGlobalCheckpoint in deletion policy#27826dnhatn merged 6 commits intoelastic:masterfrom
Conversation
Today we use the in-memory global checkpoint from SequenceNumbersService to clean up unneeded commit points, however the latest global checkpoint may haven't fsynced to the disk yet. If the translog checkpoint fsync failed and we already use a higher global checkpoint to clean up commit points, then we may have removed a safe commit which we try to keep for recovery. This commit updates the deletion policy using lastSyncedGlobalCheckpoint from Translog rather the in memory global checkpoint.
bleskes
left a comment
There was a problem hiding this comment.
Thx Nhat. I left some comments
| Channels.readFromFileChannelWithEofException(channel, position, targetBuffer); | ||
| } | ||
|
|
||
| private static Checkpoint writeCheckpoint( |
There was a problem hiding this comment.
why is this relevant? I rather not touch files that aren't relevant for the PR (feel free to push such a change as removal of dead code without a PR)
| globalCheckpoint.set(randomIntBetween( | ||
| Math.toIntExact(engine.seqNoService().getGlobalCheckpoint()), | ||
| Math.toIntExact(engine.seqNoService().getLocalCheckpoint()))); | ||
| engine = new InternalEngine(config(indexSettings, store, createTempDir(), NoMergePolicy.INSTANCE, null), seqNoServiceSupplier) { |
There was a problem hiding this comment.
why did we lose the try with resources clause?
There was a problem hiding this comment.
We close the engine in tearDown but I put the try-clause back.
There was a problem hiding this comment.
I see - we reuse the same engine variable. I think it's cleaner to have it contained.
| engine.getTranslog().sync(); | ||
| } | ||
| if (frequently()) { | ||
| final long lastSyncedGlobalCheckpoint = engine.getTranslog().getLastSyncedGlobalCheckpoint(); |
There was a problem hiding this comment.
strictly speaking I think we need to read this from disk after the flush - i.e., make sure that what's on disk is OK.
| @Override | ||
| protected void commitIndexWriter(IndexWriter writer, Translog translog, String syncId) throws IOException { | ||
| // The global checkpoint is advanced but not fsynced yet. | ||
| final long lagging = seqNoService().getLocalCheckpoint() - seqNoService().getGlobalCheckpoint(); |
There was a problem hiding this comment.
I don't follow this - why do we need to query the global checkpoint from the seqNoService? we alread have access to globalCheckpoint ?
I think what you mean here is something like this - no if no rarely:
// Advance the global checkpoint during the flush to create a lag between what's persisted in the translog (and is visible for CombinedDeletionPolicy) and what's in memory in the SequenceServices
globalCheckpoint.set(randomLongBetween(globalCheckpoint.get(), seqNoService().getLocalCheckpoint()));
wdyt?
|
@bleskes I have addressed your comments. Would you please take a look? Thank you. |
|
Thanks @bleskes. |
Today we use the in-memory global checkpoint from SequenceNumbersService to clean up unneeded commit points, however the latest global checkpoint may haven't fsynced to the disk yet. If the translog checkpoint fsync failed and we already use a higher global checkpoint to clean up commit points, then we may have removed a safe commit which we try to keep for recovery. This commit updates the deletion policy using lastSyncedGlobalCheckpoint from Translog rather the in memory global checkpoint. Relates elastic#27606
) Today we use the in-memory global checkpoint from SequenceNumbersService to clean up unneeded commit points, however the latest global checkpoint may haven't fsynced to the disk yet. If the translog checkpoint fsync failed and we already use a higher global checkpoint to clean up commit points, then we may have removed a safe commit which we try to keep for recovery. This commit updates the deletion policy using lastSyncedGlobalCheckpoint from Translog rather the in memory global checkpoint. This is a backport of #27826.
|
Backported in #27866 |
Today we use the in-memory global checkpoint from SequenceNumbersService
to clean up unneeded commit points, however the latest global checkpoint
may haven't fsynced to the disk yet. If the translog checkpoint fsync
failed and we already use a higher global checkpoint to clean up commit
points, then we may have removed a safe commit which we try to keep for
recovery.
This commit updates the deletion policy using lastSyncedGlobalCheckpoint
from Translog rather the in memory global checkpoint.
Relates #27606