Adds average document size to DocsStats by dnhatn · Pull Request #27117 · elastic/elasticsearch

dnhatn · 2017-10-26T03:37:51Z

This change is required in order to support a size based check for the
index rollover.

The average document size is estimated by sampling only existing
segments. We prefer using segments rather than StoreStats because
StoreStats is not reliable if indexing or merging operations are in
progress.

Relates #27004

This change is required in order to support a size based check for the index rollover. The average document size is estimated by sampling only existing segments. We prefer using segments rather than StoreStats because StoreStats is not reliable if indexing or merging operations are in progress. Relates elastic#27004

dnhatn · 2017-10-26T03:38:50Z

@jasontedor, Should we also expose this stat to the REST layer (eg. _cat)?

jpountz

I left some minor comment but in general it LGTM.

jpountz · 2017-10-26T09:58:44Z

core/src/main/java/org/elasticsearch/index/shard/DocsStats.java

        return this.deleted;
    }

+    public long getAverageSizeInBytes() {


Can you add javadocs?

jpountz · 2017-10-26T09:59:53Z

core/src/main/java/org/elasticsearch/index/shard/DocsStats.java

    public void readFrom(StreamInput in) throws IOException {
        count = in.readVLong();
        deleted = in.readVLong();
+        if (in.getVersion().onOrAfter(Version.V_6_1_0)) {


you might want to make it Version.V_7_0_0 for now and change it back after this change is backported in order not to cause failures in the multi-version cluster qa tests

Thanks for the hint. I temporarily made this for v7 only (f38e957)

jpountz · 2017-10-26T10:03:46Z

core/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

+            indexShard = newStartedShard(true);
+            int smallDocNum = randomIntBetween(5, 100);
+            for (int i = 0; i < smallDocNum; i++) {
+                indexDoc(indexShard, "test", "small-" + i);


I think we've been trying to use doc as a type name in new tests whenever possible. Can you rename?

s1monw

I left a suggestion.

s1monw · 2017-10-26T10:05:39Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -880,8 +880,15 @@ public FlushStats flushStats() {
    }

    public DocsStats docStats() {


I think this should rather be something like this:

public DocsStats docStats() { long numDocs = 0; long numDeletedDocs = 0; long sizeInByte = 0; List<Segment> segments = segments(false); for (Segment segment : segments) { if (segment.search) { numDocs += segment.getNumDocs(); numDeletedDocs += segment.getDeletedDocs(); sizeInByte += segment.getSizeInBytes(); } } return new DocsStats(numDocs, numDeletedDocs, sizeInByte); }

that way we maintain a consistent total and we can calculate the average at read time and aggregation of doc stats will be much simpler? I also think we should make sure the size in bytes is based on the currently used reader which is guaranteed by the Segment#search flag. WDYT?

Yes, this makes the DocsStats simpler and the average value more accurate. I have updated this in 662f062

dnhatn · 2017-10-26T15:10:11Z

@jpountz @simonw I have addressed your feedbacks. Could you please have another review? Thank you.

dakrone

I left two really minor comments, this LGTM regardless of what you choose :)

dakrone · 2017-10-26T16:32:42Z

core/src/main/java/org/elasticsearch/index/shard/DocsStats.java


-    public void add(DocsStats docsStats) {
-        if (docsStats == null) {
+    public void add(DocsStats that) {


Personally that is a bit too close to this for typo avoidance, so I'd prefer other, but it's so minor that it's totally up to you

I also prefer other over that (addressed in 6cca080) but I used that in order to have the same indention for this expression (removed).

- long totalBytes = this.averageSizeInBytes * (this.count + this.deleted) - + that.averageSizeInBytes * (that.count + that.deleted);

dakrone · 2017-10-26T17:27:30Z

core/src/main/java/org/elasticsearch/index/shard/DocsStats.java

        count = in.readVLong();
        deleted = in.readVLong();
+        if (in.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) {
+            totalSizeInBytes = in.readVLong();


Should we set totalSizeInBytes to -1 to indicate that it cannot be read and not that there are 0 bytes in use for 6.x nodes?

Done 96e9be7

dakrone · 2017-10-26T17:33:24Z

core/src/main/java/org/elasticsearch/index/shard/DocsStats.java

    public void readFrom(StreamInput in) throws IOException {
        count = in.readVLong();
        deleted = in.readVLong();
+        if (in.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) {


Also, this should be V_6_1_0 since you will be backporting this to the 6.x branch

@jpountz recommended to make this for v7, then change for the backport later.

This is correct, it should be 7.0.0. Then when you backport set it to 6.1.0 in the 6.x branch and make sure that the BWC tests in master against 6.x pass (you might have to skip some of them). Then push a commit to master flipping the version to 6.1.0 and removing the skips.

Ahh okay, I hadn't realized we were doing it the reverse way now :)

I am not sure what you mean by the reverse way, these are part of the steps to have green CI on all branches every step of the way.

dakrone · 2017-10-26T17:33:42Z

core/src/main/java/org/elasticsearch/index/shard/DocsStats.java

    public void writeTo(StreamOutput out) throws IOException {
        out.writeVLong(count);
        out.writeVLong(deleted);
+        if (out.getVersion().onOrAfter(Version.V_7_0_0_alpha1)) {


Same here for V_6_1_0

s1monw

LGTM

This change is required in order to support a size based check for the index rollover. The index size is estimated by sampling the existing segments only. We prefer using segments to StoreStats because StoreStats is not reliable if indexing or merging operations are in progress. Relates #27004

Relates #27117

* master: (63 commits) [Docs] Fix note in bucket_selector [Docs] Fix indentation of examples (elastic#27168) [Docs] Clarify `span_not` query behavior for non-overlapping matches (elastic#27150) [Docs] Remove first person "I" from getting started (elastic#27155) [Docs] Correct link target for datatype murmur3 (elastic#27143) Fix division by zero in phrase suggester that causes assertion to fail Enable Docstats with totalSizeInBytes for 6.1.0 Adds average document size to DocsStats (elastic#27117) Upgrade Painless from ANTLR 4.5.1-1 to ANTLR 4.5.3. (elastic#27153) Exists template needs a template name (elastic#25988) [Tests] Fix occasional test failure due to two random values being the same Fix beidermorse phonetic token filter for unspecified `languageset` (elastic#27112) Fix max score tracking with field collapsing (elastic#27122) [Doc] Add Ingest CSV Processor Plugin to plugin as a community plugin (elastic#27105) Removed the beta tag from cross-cluster search fixed typo in ConstructingObjectParse (elastic#27129) Allow for the Painless Definition to have multiple instances (elastic#27096) Apply missing request options to the expand phase (elastic#27118) Only pull SegmentReader once in getSegmentInfo (elastic#27121) Fix BWC for discovery stats ...

dnhatn requested review from dakrone and jasontedor October 26, 2017 03:39

dnhatn mentioned this pull request Oct 26, 2017

Add size-based condition to the index rollover API #27115

Closed

dnhatn added v6.1.0 v7.0.0 :Core/Infra/Stats Statistics tracking and retrieval APIs >enhancement labels Oct 26, 2017

trim a long line

a381d98

dnhatn changed the title ~~Adds document average size to DocsStats~~ Adds average document size to DocsStats Oct 26, 2017

dnhatn requested a review from s1monw October 26, 2017 04:05

jpountz approved these changes Oct 26, 2017

View reviewed changes

s1monw suggested changes Oct 26, 2017

View reviewed changes

dnhatn added 2 commits October 26, 2017 10:24

make it for v7 for now

f38e957

tracks total bytes, not average bytes

662f062

dakrone approved these changes Oct 26, 2017

View reviewed changes

dakrone reviewed Oct 26, 2017

View reviewed changes

dnhatn added 3 commits October 26, 2017 13:40

use other instead that

6cca080

set size = -1 for earlier versions

96e9be7

Merge branch 'master' into avg-doc-size

ca3023b

s1monw approved these changes Oct 28, 2017

View reviewed changes

dnhatn merged commit 07d270b into elastic:master Oct 28, 2017

dnhatn added the backport pending label Oct 28, 2017

dnhatn deleted the avg-doc-size branch October 28, 2017 17:21

dnhatn added a commit that referenced this pull request Oct 28, 2017

Enable Docstats with totalSizeInBytes for 6.1.0

ba167f7

Relates #27117

dnhatn added a commit that referenced this pull request Oct 28, 2017

Enable Docstats with totalSizeInBytes for 6.1.0

d01ad93

Relates #27117

dnhatn removed the backport pending label Oct 28, 2017

joegallo mentioned this pull request Jul 13, 2023

Add docs.total_size_in_bytes to the Index Stats API #97670

Closed

		@@ -880,8 +880,15 @@ public FlushStats flushStats() {
		}

		public DocsStats docStats() {

Conversation

dnhatn commented Oct 26, 2017

Uh oh!

dnhatn commented Oct 26, 2017

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Oct 26, 2017

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants