Adds average document size to DocsStats#27117
Conversation
This change is required in order to support a size based check for the index rollover. The average document size is estimated by sampling only existing segments. We prefer using segments rather than StoreStats because StoreStats is not reliable if indexing or merging operations are in progress. Relates elastic#27004
|
@jasontedor, Should we also expose this stat to the REST layer (eg. |
jpountz
left a comment
There was a problem hiding this comment.
I left some minor comment but in general it LGTM.
| return this.deleted; | ||
| } | ||
|
|
||
| public long getAverageSizeInBytes() { |
| public void readFrom(StreamInput in) throws IOException { | ||
| count = in.readVLong(); | ||
| deleted = in.readVLong(); | ||
| if (in.getVersion().onOrAfter(Version.V_6_1_0)) { |
There was a problem hiding this comment.
you might want to make it Version.V_7_0_0 for now and change it back after this change is backported in order not to cause failures in the multi-version cluster qa tests
There was a problem hiding this comment.
Thanks for the hint. I temporarily made this for v7 only (f38e957)
| indexShard = newStartedShard(true); | ||
| int smallDocNum = randomIntBetween(5, 100); | ||
| for (int i = 0; i < smallDocNum; i++) { | ||
| indexDoc(indexShard, "test", "small-" + i); |
There was a problem hiding this comment.
I think we've been trying to use doc as a type name in new tests whenever possible. Can you rename?
| @@ -880,8 +880,15 @@ public FlushStats flushStats() { | |||
| } | |||
|
|
|||
| public DocsStats docStats() { | |||
There was a problem hiding this comment.
I think this should rather be something like this:
public DocsStats docStats() {
long numDocs = 0;
long numDeletedDocs = 0;
long sizeInByte = 0;
List<Segment> segments = segments(false);
for (Segment segment : segments) {
if (segment.search) {
numDocs += segment.getNumDocs();
numDeletedDocs += segment.getDeletedDocs();
sizeInByte += segment.getSizeInBytes();
}
}
return new DocsStats(numDocs, numDeletedDocs, sizeInByte);
}that way we maintain a consistent total and we can calculate the average at read time and aggregation of doc stats will be much simpler? I also think we should make sure the size in bytes is based on the currently used reader which is guaranteed by the Segment#search flag. WDYT?
There was a problem hiding this comment.
Yes, this makes the DocsStats simpler and the average value more accurate. I have updated this in 662f062
dakrone
left a comment
There was a problem hiding this comment.
I left two really minor comments, this LGTM regardless of what you choose :)
|
|
||
| public void add(DocsStats docsStats) { | ||
| if (docsStats == null) { | ||
| public void add(DocsStats that) { |
There was a problem hiding this comment.
Personally that is a bit too close to this for typo avoidance, so I'd prefer other, but it's so minor that it's totally up to you
There was a problem hiding this comment.
I also prefer other over that (addressed in 6cca080) but I used that in order to have the same indention for this expression (removed).
- long totalBytes = this.averageSizeInBytes * (this.count + this.deleted)
- + that.averageSizeInBytes * (that.count + that.deleted);| count = in.readVLong(); | ||
| deleted = in.readVLong(); | ||
| if (in.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) { | ||
| totalSizeInBytes = in.readVLong(); |
There was a problem hiding this comment.
Should we set totalSizeInBytes to -1 to indicate that it cannot be read and not that there are 0 bytes in use for 6.x nodes?
| public void readFrom(StreamInput in) throws IOException { | ||
| count = in.readVLong(); | ||
| deleted = in.readVLong(); | ||
| if (in.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) { |
There was a problem hiding this comment.
Also, this should be V_6_1_0 since you will be backporting this to the 6.x branch
There was a problem hiding this comment.
@jpountz recommended to make this for v7, then change for the backport later.
There was a problem hiding this comment.
This is correct, it should be 7.0.0. Then when you backport set it to 6.1.0 in the 6.x branch and make sure that the BWC tests in master against 6.x pass (you might have to skip some of them). Then push a commit to master flipping the version to 6.1.0 and removing the skips.
There was a problem hiding this comment.
Ahh okay, I hadn't realized we were doing it the reverse way now :)
There was a problem hiding this comment.
I am not sure what you mean by the reverse way, these are part of the steps to have green CI on all branches every step of the way.
| public void writeTo(StreamOutput out) throws IOException { | ||
| out.writeVLong(count); | ||
| out.writeVLong(deleted); | ||
| if (out.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) { |
This change is required in order to support a size based check for the index rollover. The index size is estimated by sampling the existing segments only. We prefer using segments to StoreStats because StoreStats is not reliable if indexing or merging operations are in progress. Relates #27004
* master: (63 commits) [Docs] Fix note in bucket_selector [Docs] Fix indentation of examples (elastic#27168) [Docs] Clarify `span_not` query behavior for non-overlapping matches (elastic#27150) [Docs] Remove first person "I" from getting started (elastic#27155) [Docs] Correct link target for datatype murmur3 (elastic#27143) Fix division by zero in phrase suggester that causes assertion to fail Enable Docstats with totalSizeInBytes for 6.1.0 Adds average document size to DocsStats (elastic#27117) Upgrade Painless from ANTLR 4.5.1-1 to ANTLR 4.5.3. (elastic#27153) Exists template needs a template name (elastic#25988) [Tests] Fix occasional test failure due to two random values being the same Fix beidermorse phonetic token filter for unspecified `languageset` (elastic#27112) Fix max score tracking with field collapsing (elastic#27122) [Doc] Add Ingest CSV Processor Plugin to plugin as a community plugin (elastic#27105) Removed the beta tag from cross-cluster search fixed typo in ConstructingObjectParse (elastic#27129) Allow for the Painless Definition to have multiple instances (elastic#27096) Apply missing request options to the expand phase (elastic#27118) Only pull SegmentReader once in getSegmentInfo (elastic#27121) Fix BWC for discovery stats ...
This change is required in order to support a size based check for the
index rollover.
The average document size is estimated by sampling only existing
segments. We prefer using segments rather than StoreStats because
StoreStats is not reliable if indexing or merging operations are in
progress.
Relates #27004