Skip to content

Use lastSyncedGlobalCheckpoint in deletion policy#27826

Merged
dnhatn merged 6 commits intoelastic:masterfrom
dnhatn:translog-gcp
Dec 16, 2017
Merged

Use lastSyncedGlobalCheckpoint in deletion policy#27826
dnhatn merged 6 commits intoelastic:masterfrom
dnhatn:translog-gcp

Conversation

@dnhatn
Copy link
Copy Markdown
Member

@dnhatn dnhatn commented Dec 14, 2017

Today we use the in-memory global checkpoint from SequenceNumbersService
to clean up unneeded commit points, however the latest global checkpoint
may haven't fsynced to the disk yet. If the translog checkpoint fsync
failed and we already use a higher global checkpoint to clean up commit
points, then we may have removed a safe commit which we try to keep for
recovery.

This commit updates the deletion policy using lastSyncedGlobalCheckpoint
from Translog rather the in memory global checkpoint.

Relates #27606

Today we use the in-memory global checkpoint from SequenceNumbersService
to clean up unneeded commit points, however the latest global checkpoint
may haven't fsynced to the disk yet. If the translog checkpoint fsync
failed and we already use a higher global checkpoint to clean up commit
points, then we may have removed a safe commit which we try to keep for
recovery.

This commit updates the deletion policy using lastSyncedGlobalCheckpoint
from Translog rather the in memory global checkpoint.
Copy link
Copy Markdown
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx Nhat. I left some comments

Channels.readFromFileChannelWithEofException(channel, position, targetBuffer);
}

private static Checkpoint writeCheckpoint(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this relevant? I rather not touch files that aren't relevant for the PR (feel free to push such a change as removal of dead code without a PR)

globalCheckpoint.set(randomIntBetween(
Math.toIntExact(engine.seqNoService().getGlobalCheckpoint()),
Math.toIntExact(engine.seqNoService().getLocalCheckpoint())));
engine = new InternalEngine(config(indexSettings, store, createTempDir(), NoMergePolicy.INSTANCE, null), seqNoServiceSupplier) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we lose the try with resources clause?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We close the engine in tearDown but I put the try-clause back.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - we reuse the same engine variable. I think it's cleaner to have it contained.

engine.getTranslog().sync();
}
if (frequently()) {
final long lastSyncedGlobalCheckpoint = engine.getTranslog().getLastSyncedGlobalCheckpoint();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strictly speaking I think we need to read this from disk after the flush - i.e., make sure that what's on disk is OK.

@Override
protected void commitIndexWriter(IndexWriter writer, Translog translog, String syncId) throws IOException {
// The global checkpoint is advanced but not fsynced yet.
final long lagging = seqNoService().getLocalCheckpoint() - seqNoService().getGlobalCheckpoint();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this - why do we need to query the global checkpoint from the seqNoService? we alread have access to globalCheckpoint ?

I think what you mean here is something like this - no if no rarely:

 // Advance the global checkpoint during the flush to create a lag between what's persisted in the translog (and is visible for CombinedDeletionPolicy) and what's in memory in the SequenceServices
globalCheckpoint.set(randomLongBetween(globalCheckpoint.get(), seqNoService().getLocalCheckpoint()));

wdyt?

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Dec 15, 2017

@bleskes I have addressed your comments. Would you please take a look? Thank you.

Copy link
Copy Markdown
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thx @dnhatn

@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Dec 16, 2017

Thanks @bleskes.

@dnhatn dnhatn merged commit 4f62b51 into elastic:master Dec 16, 2017
@dnhatn dnhatn deleted the translog-gcp branch December 16, 2017 16:03
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Dec 18, 2017
Today we use the in-memory global checkpoint from SequenceNumbersService
to clean up unneeded commit points, however the latest global checkpoint
may haven't fsynced to the disk yet. If the translog checkpoint fsync
failed and we already use a higher global checkpoint to clean up commit
points, then we may have removed a safe commit which we try to keep for
recovery.

This commit updates the deletion policy using lastSyncedGlobalCheckpoint
from Translog rather the in memory global checkpoint.

Relates elastic#27606
dnhatn added a commit that referenced this pull request Dec 19, 2017
)

Today we use the in-memory global checkpoint from SequenceNumbersService
to clean up unneeded commit points, however the latest global checkpoint
may haven't fsynced to the disk yet. If the translog checkpoint fsync
failed and we already use a higher global checkpoint to clean up commit
points, then we may have removed a safe commit which we try to keep for
recovery.

This commit updates the deletion policy using lastSyncedGlobalCheckpoint
from Translog rather the in memory global checkpoint.

This is a backport of #27826.
@dnhatn
Copy link
Copy Markdown
Member Author

dnhatn commented Dec 19, 2017

Backported in #27866

@clintongormley clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v6.2.0 v7.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants