Feature/replace avl digest with merging digest by not-napoleon · Pull Request #35182 · elastic/elasticsearch

not-napoleon · 2018-11-01T20:50:12Z

Relates to #28702

@colings86 had started this conversion in the above referenced PR. This builds on his work with unit tests for deserializing between the two t-digest implementations.

This is a work in progress because the testing methodology is ready for feedback, but specific test parameters still need to be tuned.

elasticmachine · 2018-11-01T20:50:14Z

Pinging @elastic/es-search-aggs

not-napoleon · 2018-11-20T16:24:08Z

The big question I set out to answer in this PR was what happens in a mixed
cluster case. Such a situation could result from an in-flight upgrade, or a
CCS query. We could see one node serialize an AVLTree based TDigest, and
another node deserialize the same bytes as a MergeTree based TDigest. In this
case, we want to know what the error is, compared to just using AVLTree. Note
that we never care to compare to a MergeTree, since the best case a user could
get without this change is the AVLTree error.

I have graphs below, but will give an overview first. Our worst drop in
accuracy was about 1.3 percentage points (all analysis is done in percentage
points, by subtracting calculated error percents). In a few cases, we
actually saw an improvement in accuracy, but overall the mixed cluster case is
still worse than before this change. The drop in accuracy in most cases
however is very small, and we don't expect customers to be in a mixed version
situation long term.

As the compression increases (NB: Higher compression values mean more
centroids, and thus more accuracy.), the gap in accuracy rapidly narrows. I
ran tests at compression values of 100, 1000, 5000, and 10000, but the graphs
below only show 100 and 1000; that's because the results for 5k and 10k are so
tightly clustered around 0 as to be invisible.

The worst mixed-implementation performance actually came with Gaussian
distributed data looking at the 99th percentile. I suspect this happens
because the algorithm is well optimized for this case, and thus the
transmission errors I'm measuring here tend to dominate. That said, I haven't
done any digging to prove out that suspicion. I'm happy to do so if we want
the extra confidence, but it didn't seem like a blocker to me.

I am concerned that this PR will add a "flaky" unit test. I've run this test
a lot locally (order of hundreds of thousands of iterations), and it does
occasionally fail with a high delta. I can keep cranking the tolerance
threshold up until I don't see that, but at some point I wonder if it's even
worth having a test. I'm open to suggestions here.

Graphs of test results (Note - when I say "99th Percentile", I mean that's the percentile I queried the TDigest for, not that this is the 99th percentile of the error):
Gaussian Distribution, 99th Percentile:

Random Distribution, 99th Percentile:

Gaussian Distribution, 90th Percentile:

Random Distribution, 90th Percentile:

colings86 · 2018-11-21T12:10:33Z

Thanks for running these benchmarks @not-napoleon. Based on the fact that the error introduced in the mixed cluster scenario is pretty low and that we expect mixed clusters to be a short-lived situation I would be happy to proceed with the switchover as is. I'd be interested to hear thoughts from @polyfractal and @jpountz on this though.

Also in terms of the "flaky" test. I think we can do one of two things (again I'm interested to see what you, @polyfractal and @jpountz think here):

Don't merge the test but view it as a benchmarking exercise we did to ensure this change would not introduce unacceptable errors in the transition period.
Run the test in a different way where it calculates the percentage difference from N requests and calculates e.g. the 90th percentile is within a tolerance instead of hoping a single request is within tolerance. The downside here might be the test time but since it's a unit test hopefully this would not make the test take too long with something like N=50.

not-napoleon · 2018-11-21T15:29:54Z

@colings86 both options for the test seem reasonable. I was running 10k iterations of the test in less than 120 seconds on my laptop, and a lot of that time is gradle overhead, so 50 shouldn't even be a blip in our test times.

polyfractal · 2018-11-21T15:50:37Z

Based on the fact that the error introduced in the mixed cluster scenario is pretty low and that we expect mixed clusters to be a short-lived situation I would be happy to proceed with the switchover as is.

Gernerally ++ to this. @not-napoleon and I were chatting last night and another point came up: since the error basically goes to zero when compression is cranked up, if a user is unhappy about the potential cross-version error they could temporarily increase compression until the mixed cluster is homogeneous again.

I did have a few questions about testing methodology after noodling over it a bit. I'm not sure if any of these are important, they are more half-baked thoughts than anything:

I wonder if maybe we should be testing that abs(modern_error) <= abs(legacy_error * 1.05), rather than abs(legacy_error - modern_error) <= 0.05 ? E.g. we probably don't care if the errors are actually similar, just that it's close-to or better.
I wonder if we should test the case where many legacy AVL get deserialized and merged? E.g. I'm thinking of the case where there are a dozen shards in the legacy AVL and a single new MergingTree. Does the ~1% error start compounding badly? I suspect it will continue to be a nicely behaved error but maybe not?

Also in terms of the "flaky" test. I think we can do one of two things (again I'm interested to see what you, @polyfractal and @jpountz think here):

I'm leaning towards option 1), or just cranking the tolerance up a bit and revisiting if it still flakes out on CI. Since this is a short-lived problem/test that will go away after everyone upgrades, it probably doesn't make a ton of sense to invest lots of time trying to keep the test stable.

not-napoleon · 2018-11-27T21:24:13Z

I'm inclined to not keep the tests around; I don't think they prove anything individually that we need for the long term.

…this code

polyfractal

👍

elasticsearchmachine · 2022-02-02T20:45:06Z

Hi @not-napoleon, I've created a changelog YAML for you.

elasticsearchmachine · 2022-07-27T14:59:48Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

martijnvg · 2023-04-19T11:21:49Z

Looks like continuing to adopt merging digest is worth our while from a performance perspective:

|                                                Min Throughput | percentile_cpu_usage_per_container_15_minutes |      0.25977     |     13.8869      |      13.6271      |  ops/s | +5245.83% |
|                                               Mean Throughput | percentile_cpu_usage_per_container_15_minutes |      0.260257    |     14.256       |      13.9958      |  ops/s | +5377.68% |
|                                             Median Throughput | percentile_cpu_usage_per_container_15_minutes |      0.260266    |     14.3062      |      14.0459      |  ops/s | +5396.76% |
|                                                Max Throughput | percentile_cpu_usage_per_container_15_minutes |      0.260605    |     14.3405      |      14.0799      |  ops/s | +5402.76% |
|                                       50th percentile latency | percentile_cpu_usage_per_container_15_minutes |   3818.22        |     68.1857      |   -3750.03        |     ms |   -98.21% |
|                                       90th percentile latency | percentile_cpu_usage_per_container_15_minutes |   3931.99        |     73.0426      |   -3858.95        |     ms |   -98.14% |
|                                       99th percentile latency | percentile_cpu_usage_per_container_15_minutes |   4047.25        |     89.1457      |   -3958.11        |     ms |   -97.80% |
|                                      100th percentile latency | percentile_cpu_usage_per_container_15_minutes |   4054.87        |     90.7218      |   -3964.15        |     ms |   -97.76% |
|                                  50th percentile service time | percentile_cpu_usage_per_container_15_minutes |   3818.22        |     68.1857      |   -3750.03        |     ms |   -98.21% |
|                                  90th percentile service time | percentile_cpu_usage_per_container_15_minutes |   3931.99        |     73.0426      |   -3858.95        |     ms |   -98.14% |
|                                  99th percentile service time | percentile_cpu_usage_per_container_15_minutes |   4047.25        |     89.1457      |   -3958.11        |     ms |   -97.80% |
|                                 100th percentile service time | percentile_cpu_usage_per_container_15_minutes |   4054.87        |     90.7218      |   -3964.15        |     ms |   -97.76% |
|                                                    error rate | percentile_cpu_usage_per_container_15_minutes |      0           |      0           |       0           |      % |     0.00% |

The two hour and 24 hour variant of this search showed a similar relative improvement.

This based on running this search from a new tsdb rally track:

{
      "aggs": {
        "0": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "{{info[0]}}",
            "time_zone": "Europe/Athens",
            "min_doc_count": 1
          },
          "aggs": {
            "1": {
              "percentiles": {
                "field": "kubernetes.container.cpu.usage.nanocores",
                "percents": [
                  95
                ]
              }
            }
          }
        }
      },
      "size": 0,
      "fields": [
        {
          "field": "@timestamp",
          "format": "date_time"
        },
        {
          "field": "event.ingested",
          "format": "date_time"
        },
        {
          "field": "process.cpu.start_time",
          "format": "date_time"
        },
        {
          "field": "system.process.cpu.start_time",
          "format": "date_time"
        }
      ],
      "script_fields": {},
      "stored_fields": [
        "*"
      ],
      "runtime_mappings": {},
      "_source": {
        "excludes": []
      },
      "query": {
        "bool": {
          "must": [],
          "filter": [
            {
              "match_phrase": {
                "data_stream.dataset": "kubernetes.container"
              }
            },
            {
              "range": {
                "@timestamp": {
                  "format": "strict_date_optional_time",
                  "gte": "{{info[1]}}",
                  "lte": "{{end_time}}"
                }
              }
            }
          ],
          "should": [],
          "must_not": []
        }
      }
    } 
}

The baseline is main branch and contender this PR.

not-napoleon · 2024-05-02T12:09:28Z

This is no longer relevant

not-napoleon added 2 commits October 25, 2018 12:23

Existing tests passing against new implementation

a34e1db

WIP - Testing Strategy

3cd7097

not-napoleon added >feature WIP :Analytics/Aggregations Aggregations labels Nov 1, 2018

not-napoleon added 2 commits November 2, 2018 15:24

Different distribution modes for testing

d018c82

Set tolerance so tests pass

6f2b2a5

colings86 mentioned this pull request Nov 9, 2018

Uses MergingDigest instead of AVLDigest in percentiles agg #28702

Closed

Up the population size

287d640

not-napoleon added v7.0.0 and removed WIP labels Nov 20, 2018

not-napoleon requested review from colings86 and polyfractal November 20, 2018 16:30

colings86 requested a review from jpountz November 21, 2018 12:04

not-napoleon added 7 commits November 30, 2018 10:56

remove exploratory testing - doesn't provide long term value to keep …

c645e34

…this code

This has likely been fixed for a while

d214897

Merge branch 'master' into feature/replace-AVLDigest-with-MergingDigest

92f642e

Merge branch 'master' into feature/replace-AVLDigest-with-MergingDigest

b0a578d

Merge branch 'master' into feature/replace-AVLDigest-with-MergingDigest

27b17e6

Merge branch 'master' into feature/replace-AVLDigest-with-MergingDigest

f2a245d

Merge branch 'master' into feature/replace-AVLDigest-with-MergingDigest

54f37b3

$polyfractal$

polyfractal approved these changes Jan 3, 2019

View reviewed changes

Merge branch 'master' into feature/replace-AVLDigest-with-MergingDigest

b7c4a84

jasontedor added the v8.0.0 label Feb 6, 2019

arteam added v8.1.0 and removed v8.0.0 labels Jan 12, 2022

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

Update docs/changelog/35182.yaml

c099aa5

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

craigtaverner added v8.4.0 and removed v8.3.0 labels May 25, 2022

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:14

mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022

csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022

kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022

rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023

not-napoleon removed the v8.8.0 label Apr 18, 2023

not-napoleon added 2 commits April 18, 2023 11:40

Merge branch 'main' into feature/replace-AVLDigest-with-MergingDigest

fbdf6c4

fix compile errors

32cfb9d

upgraded to t-digest 3.3

2c43e61

martijnvg mentioned this pull request May 3, 2023

Improve metric query performance #95776

Closed

7 tasks

kkrik-es mentioned this pull request May 15, 2023

Fork TDigest library #96086

Merged

not-napoleon closed this May 2, 2024

msfroh mentioned this pull request Jun 9, 2025

[AUTOCUT] Gradle Check Flaky Test Report for TDigestStateTests opensearch-project/OpenSearch#18476

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/replace avl digest with merging digest#35182

Feature/replace avl digest with merging digest#35182
not-napoleon wants to merge 20 commits intoelastic:mainfrom
not-napoleon:feature/replace-AVLDigest-with-MergingDigest

not-napoleon commented Nov 1, 2018

Uh oh!

elasticmachine commented Nov 1, 2018

Uh oh!

not-napoleon commented Nov 20, 2018

Uh oh!

colings86 commented Nov 21, 2018

Uh oh!

not-napoleon commented Nov 21, 2018

Uh oh!

polyfractal commented Nov 21, 2018

Uh oh!

not-napoleon commented Nov 27, 2018

Uh oh!

$@polyfractal$ polyfractal left a comment

Uh oh!

elasticsearchmachine commented Feb 2, 2022

Uh oh!

elasticsearchmachine commented Jul 27, 2022

Uh oh!

martijnvg commented Apr 19, 2023

Uh oh!

not-napoleon commented May 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

not-napoleon commented Nov 1, 2018

Uh oh!

elasticmachine commented Nov 1, 2018

Uh oh!

not-napoleon commented Nov 20, 2018

Uh oh!

colings86 commented Nov 21, 2018

Uh oh!

not-napoleon commented Nov 21, 2018

Uh oh!

polyfractal commented Nov 21, 2018

Uh oh!

not-napoleon commented Nov 27, 2018

Uh oh!

polyfractal left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Feb 2, 2022

Uh oh!

elasticsearchmachine commented Jul 27, 2022

Uh oh!

martijnvg commented Apr 19, 2023

Uh oh!

not-napoleon commented May 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

$@polyfractal$ polyfractal left a comment