[Uptime] Use scripted metric for snapshot calculation#58247

Merged

andrewvc merged 7 commits intoelastic:7.6from

andrewvc:scripted-metric-count

Feb 24, 2020

Contributor

andrewvc commented Feb 21, 2020 •

edited

Loading

Summary

Fixes #58079

This is an improved version of #58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
This was checked for keyboard-only and screenreader accessibility
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser
This was checked for cross-browser compatibility, including a check against IE11

For maintainers

This was checked for breaking API changes and was labeled appropriately


          [Uptime] Use scripted metric for snapshot calculation

c5e4a50

andrewvc added bug [zube]: In Review Team:Uptime - DEPRECATED v7.6.1 labels

andrewvc requested a review from justinkambic

February 21, 2020 18:32

andrewvc self-assigned this

Contributor

elasticmachine commented Feb 21, 2020

Pinging @elastic/uptime (Team:uptime)

andrewvc added the release_note:fix label

andrewvc added 3 commits

February 21, 2020 13:29


          [Uptime] Use scripted metric for snapshot calculation

ae072a6


          Add more comments

388c5e6


          Remove unnecessary import

f7f7a95

justinkambic reviewed

View reviewed changes

Contributor

justinkambic left a comment

I had a few questions and suggestions for cleaning, naming, commenting, but the base code looks good to me. I also still need to finish a functional review.

...cy/plugins/uptime/server/lib/adapters/monitor_states/elasticsearch_monitor_states_adapter.ts Outdated

+                        return state;
+                      `,
+                        reduce_script: `
+                        // Use a treemap since it's later traversable in sorted order

Contributor

justinkambic Feb 21, 2020

it's later traversable in sorted order

I'm not familiar with the TreeMap class, am I understanding correctly that it is self-balancing? Meaning as keys are inserted, it handles the sort based on the comparison function you provide to merge below?

I.e. if I have a map with keys 1, 4, 5 and I insert 3, then traverse the entrySet, it will iterate like 1 3 4 5?

If that's correct, it might be good to expand this comment a little, since we are writing Java in a TypeScript file; it's reasonable that someone viewing this code might not be able to understand it easily.

Contributor Author

andrewvc Feb 21, 2020

Exactly, it will maintain the keys in order. Merge doesn't have anything to do with the sorting, I've added a comment below that explains that. Merge just updates the value if we have a more recent check from the same location.

The order of the treemap uses the built-in compareTo implementation of java's String class.

...cy/plugins/uptime/server/lib/adapters/monitor_states/elasticsearch_monitor_states_adapter.ts

+                          // Parse the length delimited id/location strings described in the map section
+                          int colonIndex = idLoc.indexOf(":");
+                          int idEnd = Integer.parseInt(idLoc.substring(0, colonIndex), 16) + colonIndex + 1;

Contributor

justinkambic Feb 21, 2020

Is 16 the radix?

Contributor Author

andrewvc Feb 21, 2020

Exactly, since we hex encode the numbers for density

...cy/plugins/uptime/server/lib/adapters/monitor_states/elasticsearch_monitor_states_adapter.ts Outdated Show resolved Hide resolved

...cy/plugins/uptime/server/lib/adapters/monitor_states/elasticsearch_monitor_states_adapter.ts

+                          String loc = idLoc.substring(idEnd, idLoc.length());
+                          String status = timeStatus.substring(timeStatus.length() - 1);
+                          locTotals.compute(loc, (k,v) -> {

Contributor

justinkambic Feb 21, 2020

A comment heading this block would ~~be helpful~~ be useful to a javascript developer 😅.

My understanding is we are updating the value for key loc, and the output of the provided function determines the new value. If the value was null, we create a new HashMap, then we increment appropriate values based on the documents we iterate over.

Contributor Author

andrewvc Feb 21, 2020

Yes, that's correct. I'll add a comment

...cy/plugins/uptime/server/lib/adapters/monitor_states/elasticsearch_monitor_states_adapter.ts Outdated Show resolved Hide resolved

...cy/plugins/uptime/server/lib/adapters/monitor_states/elasticsearch_monitor_states_adapter.ts

-                    counts[leastCommonStatus] = await slowStatusCount(context, leastCommonStatus);
-                    counts[mostCommonStatus] = counts.total - counts[leastCommonStatus];
-                  }
+                  const counts = await statusCount(context);

Contributor

justinkambic Feb 21, 2020

Do you think it'd be better to name this function getStatusCount?

Contributor Author

andrewvc Feb 21, 2020

I'm not sure if get has any particular meaning at least in my head, unless there's something to juxtapose it against.

Contributor

justinkambic Feb 21, 2020

That's fair

...cy/plugins/uptime/server/lib/adapters/monitor_states/elasticsearch_monitor_states_adapter.ts

-                };
-              };
-              const slowStatusCount = async (context: QueryContext, status: string): Promise<number> => {

Contributor

justinkambic Feb 21, 2020

So now rather than having a fast/slow count, we're able to just have one counter (slower, but still fast, and always accurate), right?

Contributor Author

andrewvc Feb 21, 2020

Exactly

andrewvc mentioned this pull request

[Uptime] Improve snapshot timespan handling #58078

Closed

7 tasks

andrewvc added 2 commits

February 21, 2020 15:35


          Add tests

8c558fb


          Incorporate PR feedback

afeaf31

Contributor Author

andrewvc commented Feb 24, 2020

@elasticmachine merge upstream


          Merge branch '7.6' into scripted-metric-count

e305bec

justinkambic approved these changes

View reviewed changes

Contributor

justinkambic left a comment

LGTM, WFG

Contributor

kibanamachine commented Feb 24, 2020

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: e305bec

History

💚 Build #28262 succeeded afeaf31
💚 Build #28250 succeeded f7f7a95
💔 Build #28220 failed 388c5e6
💔 Build #28202 failed c5e4a50

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

andrewvc merged commit c11e866 into elastic:7.6

andrewvc deleted the scripted-metric-count branch

February 24, 2020 17:45

andrewvc added a commit to andrewvc/kibana that referenced this pull request


          [Uptime] Use scripted metric for snapshot calculation (elastic#58247)

420f3e5

Fixes elastic#58079

This is an improved version of elastic#58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.

andrewvc mentioned this pull request

[Uptime] Use scripted metric for snapshot calculation (#58247) #58389

Merged

7 tasks

andrewvc added a commit that referenced this pull request


          [Uptime] Use scripted metric for snapshot calculation (#58247) (#58389)

5eefdbb

Fixes #58079

This is an improved version of #58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.

andrewvc mentioned this pull request

[7.x] [Uptime] Use scripted metric for snapshot calculation (#58247) (#58389) #58415

Merged

andrewvc added a commit to andrewvc/kibana that referenced this pull request


          [Uptime] Use scripted metric for snapshot calculation (elastic#58247) (…

80ad29a

…elastic#58389)

Fixes elastic#58079

This is an improved version of elastic#58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.

jloleysens added a commit to jloleysens/kibana that referenced this pull request


          Merge branch 'master' of github.com:elastic/kibana into console/featu…

00c9a87

…re/files-and-filetree

* 'master' of github.com:elastic/kibana: (174 commits)
  [SIEM] Fix unnecessary re-renders on the Overview page (elastic#56587)
  Don't mutate error message (elastic#58452)
  Fix service map popover transaction duration (elastic#58422)
  [ML] Adding filebeat config to file dataviz (elastic#58152)
  [Uptime] Improve refresh handling when generating test data (elastic#58285)
  [Logs / Metrics UI] Remove path prefix from ViewSourceConfigur… (elastic#58238)
  [ML] Functional tests - adjust classification model memory (elastic#58445)
  [ML] Use event.timezone instead of beat.timezone in file upload (elastic#58447)
  [Logs UI] Unskip and stabilitize log column configuration tests (elastic#58392)
  [Telemetry] Separate the license retrieval from the stats in the usage collectors (elastic#57332)
  hide welcome screen for cloud (elastic#58371)
  Move src/legacy/ui/public/notify/app_redirect to kibana_legacy (elastic#58127)
  [ML] Functional tests - stabilize typing during df analytics creation (elastic#58227)
  fix short url in spaces (elastic#58313)
  [SIEM] Upgrades cypress to version 4.0.2 (elastic#58400)
  [Index management] Move to new platform "plugins" folder (elastic#58109)
  [kbn/optimizer] disable parallelization in terser plugin (elastic#58396)
  [Uptime] Delete useless try...catch blocks (elastic#58263)
  [Uptime] Use scripted metric for snapshot calculation (elastic#58247) (elastic#58389)
  [APM] Stabilize agent configuration API (elastic#57767)
  ...

# Conflicts:
#	src/plugins/console/public/application/containers/editor/legacy/console_editor/editor.tsx

elasticmachine added a commit to dhurley14/kibana that referenced this pull request


          [Uptime] Use scripted metric for snapshot calculation (elastic#58247) (…

51dc910

…elastic#58389) (elastic#58415)

Fixes elastic#58079

This is an improved version of elastic#58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug release_note:fix Team:Uptime - DEPRECATED v7.6.1