Conversation
As the translog evolves towards a full operations log as part of the sequence numbers push, there is a need for the translog to be able to represent operations for which a sequence number was assigned, but the operation did not mutate the index. Examples of how this can arise are operations that fail after the sequence number is assigned, and gaps in this history that arise when an operation is assigned a sequence number but the operation never completed (e.g., a node crash). It is important that these operations appear in the history so that they can be replicated and replayed during recovery as otherwise the history will be incomplete and local checkpoints will not be able to advance. This commit introduces a no-op to the translog to set the stage for these efforts.
f19e12e to
36b9edb
Compare
bleskes
left a comment
There was a problem hiding this comment.
Looks great. Left some very minor comments.
| writeByte((byte) i); | ||
| } | ||
|
|
||
| public static int lengthVLong(long i) { |
There was a problem hiding this comment.
I removed this method.
| @@ -910,7 +927,7 @@ public abstract static class Operation { | |||
|
|
|||
| /** type of operation (index, delete), subclasses use static types */ | |||
There was a problem hiding this comment.
nit: comment needs updating
There was a problem hiding this comment.
I updated this comment.
|
|
||
| @Override | ||
| public int estimatedSizeInBytes() { | ||
| return 2 * reason.length() + StreamOutput.lengthVLong(seqNo()) + StreamOutput.lengthVLong(primaryTerm()); |
There was a problem hiding this comment.
do we really need to be so correct here? I mean, it's only used for the indexing memory buffer and all other places ignore the seq nos. I think we dont need an extra method?
PS. This made me think of something else - we can't use vlongs for seq# - they can be negative when coming from an old primary..
There was a problem hiding this comment.
I changed the serialization to be (read|write)Long and updated the size estimate to just add twice the number of bytes per long.
| noOpResult.setTook(System.nanoTime() - noOp.startTime()); | ||
| noOpResult.freeze(); | ||
| return noOpResult; | ||
| } finally { |
There was a problem hiding this comment.
seq no must be assigned here, no?
There was a problem hiding this comment.
Yes, but I prefer the symmetry with innerIndex and innerDelete.
|
|
||
| out.writeByte(versionType.getValue()); | ||
| out.writeLong(autoGeneratedIdTimestamp); | ||
| out.writeVLong(seqNo); |
There was a problem hiding this comment.
not related to this change but a (big) bug - this can unassigned. I wonder how our BWC didn't catch it... maybe we don't run the nodes with assertion enabled?
There was a problem hiding this comment.
We do run with assertions enabled; isn't it because we guard in the serialization:
if (format >= FORMAT_SEQ_NO) {
seqNo = in.readVLong();
primaryTerm = in.readVLong();
}
Either way, I think it's simpler to just use (read|write)Long everywhere.
There was a problem hiding this comment.
I'm +1 on using write/read long. I still don't get the bwc aspect though - we shouldn't be able to write a negative long with writeVLong. A request coming in from a primary on an old node should have it's seq# assigned to -2L
There was a problem hiding this comment.
I discussed this with @bleskes via another channel. Our current theory is that assertions are not enabled for standalone nodes running in our integration tests. I will investigate and address accordingly.
There was a problem hiding this comment.
Assertions are indeed not enabled on the standalone integration tests. I think they should be, they would have caught at least one bug in core (that I found after I enabled assertions there), and the issue here before changing the serialization to (read|write)Long. I will open a PR soon.
|
|
||
| @Override | ||
| public long estimateSize() { | ||
| return 2 * reason.length() + StreamOutput.lengthVLong(seqNo) + StreamOutput.lengthVLong(primaryTerm); |
There was a problem hiding this comment.
same issue here - we can just use size of long here - reason.length() is also wrong here - we write as utf8.
There was a problem hiding this comment.
I changed the serialization to just use (read|write)Long, and the size estimate accordingly.
| } | ||
|
|
||
| BytesStreamOutput out = new BytesStreamOutput(); | ||
| total.writeTo(out); |
There was a problem hiding this comment.
is this total test replaced by something else?
There was a problem hiding this comment.
I added a new test for the totals.
* master: Simplify Unicast Zen Ping (elastic#22277) Replace IndicesQueriesRegistry (elastic#22289) Fixed document mistake and fit for 5.1.1 API [TEST] improve error message in ESTestCase#assertWarnings [TEST] remove deleted test classes from checkstyle suppressions [TEST] make ESSingleNodeTestCase tests repeatable (elastic#22283) Link for setting page in elasticsearch.yml is outdated Factor out sort values from InternalSearchHit (elastic#22080) Add ID for percolate query to Java API docs x_refresh.yaml tests should use unique index names and doc ids to ease debugging IndicesStoreIntegrationIT should not use start recovery sending as an indication that the recovery started Added base class for testing aggregators and some initial tests for `terms`, `top_hits` and `min` aggregations. Add link to foreach processor to ingest-attachment.asciidoc
|
Thanks @bleskes, I pushed a commit responding to your feedback. |
bleskes
left a comment
There was a problem hiding this comment.
Thx @jasontedor . That BWC aspect (not-related to this PR) still puzzles me though - how could we have written -2 using writeVlong?
| * @return failure if the failure is a document specific failure (e.g. analysis chain failure) | ||
| * or throws Exception if the failure caused the engine to fail (e.g. out of disk, lucene tragic event) | ||
| * | ||
| * <p> |
There was a problem hiding this comment.
I guess the IDE did that? I will remove.
|
Thanks @bleskes. |
As the translog evolves towards a full operations log as part of the
sequence numbers push, there is a need for the translog to be able to
represent operations for which a sequence number was assigned, but the
operation did not mutate the index. Examples of how this can arise are
operations that fail after the sequence number is assigned, and gaps in
this history that arise when an operation is assigned a sequence number
but the operation never completed (e.g., a node crash). It is important
that these operations appear in the history so that they can be
replicated and replayed during recovery as otherwise the history will be
incomplete and local checkpoints will not be able to advance. This
commit introduces a no-op to the translog to set the stage for these
efforts.
Relates #10708