Introducing a translog deletion policy#24950
Conversation
…_policy # Conflicts: # core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java # core/src/main/java/org/elasticsearch/index/shard/IndexShard.java # core/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java
s1monw
left a comment
There was a problem hiding this comment.
I left some minors really this looks awesome!
| this.uidField = engineConfig.getIndexSettings().isSingleType() ? IdFieldMapper.NAME : UidFieldMapper.NAME; | ||
| this.versionMap = new LiveVersionMap(); | ||
| final TranslogDeletionPolicy translogDeletionPolicy = new TranslogDeletionPolicy(); | ||
| this.deletionPolicy = new CombinedDeletionPolicy( |
There was a problem hiding this comment.
nit: does this need to break into 3 lines or is this maybe a leftover?
There was a problem hiding this comment.
sadly it doesn't fit in one line, I'll make it like this - it seems you prefer it:
this.deletionPolicy = new CombinedDeletionPolicy(
new SnapshotDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy()), translogDeletionPolicy, openMode);
| return operationCounter; | ||
| } | ||
|
|
||
| public long lastSyncedGlobalCheckpoint() { |
There was a problem hiding this comment.
actually, this can be completely removed - it's a leftover from the "somewhat into the future" POC. good catch.
|
|
||
| /** Records how many views are held against each | ||
| * translog generation */ | ||
| protected final Map<Long,Integer> translogRefCounts = new HashMap<>(); |
There was a problem hiding this comment.
maybe this is a good place for LongIntMap?
There was a problem hiding this comment.
Another option would be Map<Long,Counter> then you can do this:
translogRefCounts.computeIfAbsent(translogGen, Counter.newCounter(false)).addAndGet(1);
//....
value = translogRefCounts.computeIfAbsent(translogGen, Counter.newCounter(false)).addAndGet(-1);It would make things easier to read IMO?
There was a problem hiding this comment.
org.apache.lucene.util.Counter that is
There was a problem hiding this comment.
I looked at LongIntMap but decided not to have a 3rd party dependency for this low performance, rarely used map. I like the Counter class usage. It simplifies things. Thanks!
| import java.util.List; | ||
| import java.util.Map; | ||
|
|
||
| public class TranslogDeletionPolicy { |
There was a problem hiding this comment.
maybe we can make this final right away?
There was a problem hiding this comment.
sure. I thought I'd need to subclass in tests but it turned out not be necessary.
| ensureOpen(); | ||
| View view = new View(lastCommittedTranslogFileGeneration); | ||
| outstandingViews.add(view); | ||
| viewGenToClean = deletionPolicy.acquireTranslogGenForView(); |
There was a problem hiding this comment.
wouldn't it be simpler if you remove the viewGenToClean and just do this return new View(deletionPolicy.acquireTranslogGenForView());
There was a problem hiding this comment.
It would be but I'm paranoid about an exception in the View constructor. This way it's clearly safe.
There was a problem hiding this comment.
there is no exception possibility here? I think this is overparanoia
There was a problem hiding this comment.
The View constructor does not even do anything, it just sets a field?
There was a problem hiding this comment.
I really do not like the use of setting viewGenToClean to -1 to indicate not to release the view. Is there a reason that you do not make viewGenToClean final and local to the try block and set a boolean flag to indicate success or not?
| * returns the minimum translog generation that is still required by the system. Any generation below | ||
| * the returned value may be safely deleted | ||
| */ | ||
| public synchronized long minTranslogGenRequired(List<TranslogReader> readers, TranslogWriter currentWriter) { |
There was a problem hiding this comment.
yeah, I missed your comment from the last time. I will remove the params. They are a leftover from the POC (to show how we would do size based deletion).
| } | ||
|
|
||
| private void setLastCommittedTranslogGeneration(List<? extends IndexCommit> commits) throws IOException { | ||
| final IndexCommit indexCommit = commits.get(commits.size() - 1); |
There was a problem hiding this comment.
can you leave a comment why we only use the last one? It would help others to reason about this code
| * An {@link IndexDeletionPolicy} that coordinates between Lucene's commits and the retention of translog generation files, | ||
| * making sure that all translog files that are need to recover from the lucene commit are not deleted. | ||
| */ | ||
| public class CombinedDeletionPolicy extends IndexDeletionPolicy { |
There was a problem hiding this comment.
make this final and maybe pkg private?
There was a problem hiding this comment.
yep. And it exposed a leftover wrong Java Doc reference.
|
Thx @s1monw . I addressed all your feedback. I will wait for @jasontedor to have a look as well. |
…_policy # Conflicts: # core/src/test/java/org/elasticsearch/index/translog/TranslogTests.java
|
Thx @jasontedor . I addressed your comments. Can you take another look? |
jasontedor
left a comment
There was a problem hiding this comment.
Some lingering nits that do not require another look from me but otherwise LGTM. Thanks @bleskes.
| * written, the current translogs file generation and it's fsynced offset in bytes. | ||
| * Each Translog has only one translog file open for writes at any time referenced by a translog generation ID. This ID is written to a | ||
| * <tt>translog.ckp</tt> file that is designed to fit in a single disk block such that a write of the file is atomic. The checkpoint file | ||
| * is written on each fsync operation of the translog and records the number of operations written, the current translogs file generation |
| * Each Translog has only one translog file open for writes at any time referenced by a translog generation ID. This ID is written to a | ||
| * <tt>translog.ckp</tt> file that is designed to fit in a single disk block such that a write of the file is atomic. The checkpoint file | ||
| * is written on each fsync operation of the translog and records the number of operations written, the current translogs file generation | ||
| * , it's fsynced offset in bytes and other important statistics. |
There was a problem hiding this comment.
Remove comma to start line, place at end of previous line.
|
Thx @s1monw , @jasontedor |
When we open a translog, we rely on the `translog.ckp` file to tell us what the maximum generation file should be and on the information stored in the last lucene commit to know the first file we need to recover. This requires coordination and is currently subject to a race condition: if a node dies after a lucene commit is made but before we remove the translog generations that were unneeded by it, the next we open the translog we will ignore those files and never delete them (I have added tests for this). This PR changes the approach to have the translog store both of those numbers in the `translog.ckp`. This means it's more self contained and easier to control. This change also decouples the translog recovery logic from the specific commit we're opening. This prepares the ground to fully utilize the deletion policy introduce elastic#24950 and store more translog data that's needed for Lucene, keep multiple lucene commits around, and be free to recover from any of them.
When we open a translog, we rely on the `translog.ckp` file to tell us what the maximum generation file should be and on the information stored in the last lucene commit to know the first file we need to recover. This requires coordination and is currently subject to a race condition: if a node dies after a lucene commit is made but before we remove the translog generations that were unneeded by it, the next time we open the translog we will ignore those files and never delete them (I have added tests for this). This PR changes the approach to have the translog store both of those numbers in the `translog.ckp`. This means it's more self contained and easier to control. This change also decouples the translog recovery logic from the specific commit we're opening. This prepares the ground to fully utilize the deletion policy introduced in #24950 and store more translog data that's needed for Lucene, keep multiple lucene commits around and be free to recover from any of them.
|
@ArielCoralogix previously we'd throw away the translog files immediately after flush. There was no hard coded time based interval. That said - I think it will be very rare this will cause much more files to be open than before. The translog still stays under 512MB as before. This changes doesn't mean we check every 12 hours, we check after each indexing request. The changes means that we keep at most 512MB for at most 12 hours. Effectively - an active translog will always be 512MB rathen then shrinking to 0 and growing again to 512MB. After 12hrs it will be cleaned away. I don't believe you have so many active shards within 12 hrs for this to seriously influence the number of open files. Or do you? |
|
Hey @bleskes, I'm Ariel's colleague. from running: sudo lsof -p |
|
Hey @bleskes adding to @farin99's comment. We currently have 150,000 open file descriptors on some of our servers (and slowly rising). This ticket has some more info: #29097 |
Currently, the decisions regarding which translog generation files to delete are hard coded in the interaction between the
InternalEngineand theTranslogclasses. This PR extracts it to a dedicated class calledTranslogDeletionPolicy, for two main reasons: