Enable a long translog retention policy by default#25294
Enable a long translog retention policy by default#25294bleskes merged 29 commits intoelastic:masterfrom
Conversation
…n_part2 # Conflicts: # core/src/main/java/org/elasticsearch/index/shard/IndexShard.java
| if (refreshTask.getInterval().equals(indexSettings.getRefreshInterval()) == false) { | ||
| rescheduleRefreshTasks(); | ||
| } | ||
| if (trimTranslogTask.getInterval().equals(indexSettings.getTranslogRetentionCheckInterval()) == false) { |
There was a problem hiding this comment.
Can you elaborate why you go down this route instead of using IndexShard#afterWriteOperation() and maybe also do it in IndexShard#flush? I'd love to rather remove tasks than adding them?! it's just way easier to reason about things if they are happening due to a user interaction form the outside
There was a problem hiding this comment.
Can you elaborate why you go down this route instead of using IndexShard#afterWriteOperation()
This is not instead but on top of it. The problem is that if people have a 12h retention policy and we need to start cleaning stuff 12hs later (assuming they index full speed and roll over to a new index).
I'd love to rather remove tasks than adding them?!
I'm totally with you. I spent some time thinking about ways to avoid the regular check but everything I came up with ended up more complex. I considered stuff scheduling things in in activity etc.. always come back to this being the clearest/easiest. I'll be more than happy to hear about alternatives.
| engine.getTranslog().rollGeneration(); | ||
| final Translog translog = engine.getTranslog(); | ||
| translog.rollGeneration(); | ||
| translog.trimUnreferencedReaders(); |
There was a problem hiding this comment.
hmm you run translog.trimUnreferencedReaders(); under the engine lock here but you run it without a lock here so I wonder when we have to run under lock and if we can assert on that?
There was a problem hiding this comment.
That's a great observation. Strictly speaking we don't need to coordinate through the write lock here. In the place you linked to I was mostly interested in the error handling and that we fail the engine if anything goes wrong (as we close the translog on error as well). I was worried for poisonous situations where we keep trying to get stuff from a closed translog. I will work on removing the read lock in the engine and fold this functionality (trim and roll) into the engine so we can be consistent. agreed?
There was a problem hiding this comment.
I think the readlock is healthy here, I wonder if we would rather fold this part into the engine and don't call it from the outside and add a parameter boolean rollGeneration to the trim method?
There was a problem hiding this comment.
I too find it weird that we have IndexShard#trimTranslog that goes through the engine (with error handling) but ultimately amounts to trimming unreferenced readers, and we have this method that does the same without going through the engine instead poking all the way through to the translog.
| } | ||
|
|
||
| private Stream<? extends BaseTranslogReader> readersAboveMinSeqNo(long minSeqNo) { | ||
| assert readLock.isHeldByCurrentThread() || writeLock.isHeldByCurrentThread(); |
There was a problem hiding this comment.
please have one assert per check it's easier to see what failed
There was a problem hiding this comment.
I'm confused - this is an or check. I will add a message.
| // we're shutdown potentially on some tragic event, don't delete anything | ||
| return; | ||
| } | ||
| if (readers.isEmpty()) { |
There was a problem hiding this comment.
maybe instead of adding this intermediate return we can invert and only do the work if it's not empty?
There was a problem hiding this comment.
yeah, I can do that and thought about it. The reason I went with this is that to do something like shouldTrimTranslog where I can add this shortcut, I will also have to go and ask the retention policy for the current trimming situation and add tests etc. Since this method is not called on every write, but I rather on every roll / flush/ background check every 10m.. I felt this was simpler. Let me know what you think.
There was a problem hiding this comment.
methods should have a single return statement IMO, everything else is more complex in 99% of the cases.
| } | ||
|
|
||
| synchronized long getViewCount(long viewGen) { | ||
| return translogRefCounts.getOrDefault(viewGen, Counter.newCounter(false)).get(); |
There was a problem hiding this comment.
you are creating a new counter anyway even if it's there I think we can safe that object and use computeIfAbsent?
|
|
||
| if (isSequenceNumberBasedRecoveryPossible) { | ||
| logger.trace("performing sequence numbers based recovery. starting at [{}]", request.startingSeqNo()); | ||
| startingSeqNo = request.startingSeqNo(); |
There was a problem hiding this comment.
the changes in here are unrelated to the trimming code, I wonder if we can do it in 2 steps and add the integration in a followup, that way reviews are much simpler and more effective?
There was a problem hiding this comment.
yes, this became less important than what it used to be after some iterations. That said the change is mostly related to reporting and that acquiring a snapshot now has a seq# in it's constructor (to avoid scanning a large translog). I can try to roll that part back but I was afraid it will change the system too much (we will stream more ops from the translog). I'm not sure it's worth it tbh.
There was a problem hiding this comment.
+727 −336 I think we should try harder to make them more contained. really!
|
@s1monw thanks. I responded to some of you comment. Will apply code changes tomorrow. |
This reverts commit 286d4a3.
|
|
||
| @Override | ||
| public void trimTranslog() throws EngineException { | ||
| try (ReleasableLock lock = readLock.acquire()) { |
There was a problem hiding this comment.
Would you mind renaming this to ignored? It keeps the IDE from complaining, and it makes it clear that the resource is intentionally unused through the scope of the try-with-resources block.
| engine.getTranslog().rollGeneration(); | ||
| final Translog translog = engine.getTranslog(); | ||
| translog.rollGeneration(); | ||
| translog.trimUnreferencedReaders(); |
There was a problem hiding this comment.
I too find it weird that we have IndexShard#trimTranslog that goes through the engine (with error handling) but ultimately amounts to trimming unreferenced readers, and we have this method that does the same without going through the engine instead poking all the way through to the translog.
| .filter(reader -> { | ||
| final long maxSeqNo = reader.getCheckpoint().maxSeqNo; | ||
| return maxSeqNo == SequenceNumbersService.UNASSIGNED_SEQ_NO || | ||
| maxSeqNo >= minSeqNo; |
There was a problem hiding this comment.
Nit: I think this fits on the preceding line without wrapping.
| private long translogSizeInBytes; | ||
| private int numberOfOperations; | ||
| private long uncommittedSizeInBytes; | ||
| private int uncommittedOperations; |
There was a problem hiding this comment.
I wonder if since these are summed in node stats if this should be a long (to reduce the chance of overflow)?
There was a problem hiding this comment.
I agree, but I think we need to do this consistently and I didn't want to touch the other numberOfOperations field (and the bwc code that entails) in the same PR. That's why I did it like this.
There was a problem hiding this comment.
To be clear, you're saying that you'll address this in a follow-up?
|
@s1monw I took out all the scheduled trimming (we can decide what to do about it later) and did the engine changes. @jasontedor I addressed your feedback. Can you please look again? |
|
Thx @jasontedor @s1monw for the iterations. |
* master: (56 commits) Initialize max unsafe auto ID timestamp on shrink Enable a long translog retention policy by default (elastic#25294) Remove `index.mapping.single_type=false` from core/tests (elastic#25331) test: single type defaults to true since alpha1 and not alpha3 Get short path name for native controllers Live primary-replica resync (no rollback) (elastic#24841) Upgrade to lucene-7.0.0-snapshot-ad2cb77. (elastic#25349) percolator: Deprecate `document_type` parameter. [DOCS] Fixed typo. [rest-api-spec/indices.refresh] Remove old params Remove redundant and broken MD5 checksum from repository-s3 (elastic#25270) Initialize sequence numbers on a shrunken index Port most snapshot/restore static bwc tests to qa:full-cluster-restart (elastic#25296) Javadoc: ThreadPool doesn't reject while shutdown (elastic#23678) test: verify `size_to_upgrade_in_bytes` in assertBusy(...) Docs: Removed duplicated line in mapping docs Add backward compatibility indices for 5.4.2 Update MockTransportService to the age of Transport.Connection (elastic#25320) Add version v5.4.2 after release IndexMetaData: Add internal format index setting (elastic#25292) ...
We currently check whether translog files can be trimmed whenever we create a new translog generation or close a view. However #25294 added a long translog retention period (12h, max 512MB by default), which means translog files should potentially be cleaned up long after there isn't any indexing activity to trigger flushes/the creation of new translog files. We therefore need a scheduled background check to clean up those files once they are no longer needed. Relates to #10708
#25147 added the translog deletion policy but didn't enable it by default. This PR enables a default retention of 512MB (same maximum size of the current translog) and an age of 12 hours (i.e., after 12 hours all translog files will be deleted). This increases to chance to have an ops based recovery, even if the primary flushed or the replica was offline for a few hours.
In order to see which parts of the translog are committed into lucene the translog stats are extended to include information about uncommitted operations.
Views now include all translog ops and guarantee, as before, that those will not go away. Snapshotting a view allows to filter out generations that are not relevant based on a specific sequence number.
I still have to write some docs and add a migration note, but I think we can start reviewing.
Relates to #10708