[ML] Improve the accuracy of model memory control by tveasey · Pull Request #122 · elastic/ml-cpp

tveasey · 2018-06-08T14:40:59Z

This makes a number of changes targeting our current memory control functionality. Specifically, these are:

We store the data gatherer object by shared pointer with one reference held by a CAnomalyDetectorModel object and one held by a CAnomalyDetector object. However, the memory was only accounted for by CAnomalyDetectorModel. Since the reference count is two we were effectively halving its accounted memory. I've changed the CResourceMonitor to work in terms of CAnomalyDetector objects and now account for both references to the data gatherer. This incidentally also means we account for the static size of CAnomalyDetector which was also be lost by the resource monitor. The impact can be large, especially for population models, for example in CAnomalyJobLimitTest::testModelledEntityCountForFixedMemoryLimit we model 45% fewer over field values as a result.
We were unnecessarily duplicating state in CDataGatherer. For example, we have access to the partition field name via the search key so don't need a copy in this class as well. (Note that this also reduces the number of parameters to constructor, which had quite a lot of fallout on the unit tests.)
The initial memory assumed per by field was out-of-date and I've adjusted accordingly.
The model memory is not static. In particular, it changes due to cyclic components added as they are detected and also as the compressibility of parts of the state changes. This means we typically underestimate model memory near the start of the job and so create too many models early on and eventually exceed the memory limit (if the job won't ultimately fit in memory without limiting). This effect can be large: I observed up to a 30% overshoot above the memory limit in the worst case that all field values are present at the start of the data set. I now use a time decaying margin on the memory limit at startup so we only create additional models gradually early on as we approach the hard limit.

As a result we now have accurate control of the true memory (I measured consistently within 5% in a unit test with a variety of realistic data characteristics).

Note I factored out the test changes, which are mainly fallout from the CDataGatherer constructor signature change and some tidy ups, from the functional code changes.

This affects results for memory limited jobs only.

…o data gatherer footprint

a function of elapsed time not buckets and assert on memory used vs target in limit test.

dimitris-athanasiou

Looks good! Left a few minor comments.

dimitris-athanasiou · 2018-06-12T11:05:23Z

include/model/CDataGatherer.h

    CDataGatherer(bool isForPersistence, const CDataGatherer& other);
-
    ~CDataGatherer();
+    CDataGatherer(const CDataGatherer&) = delete;


That's a cool language feature!

dimitris-athanasiou · 2018-06-12T11:09:36Z

lib/maths/CTimeSeriesDecompositionDetail.cc

-            result += core::CMemory::dynamicSize(window);
+            // The 0.3 is a rule-of-thumb estimate of the worst case
+            // compression ratio we achieve on the test state.
+            result += 0.3 * core::CMemory::dynamicSize(window);


Don't we only compress on persist?

In fact, no. As of #100, we compress the raw bytes of some this object we actually hold in memory.

Ah, I recall you saying we'd do that but I missed the fact it's already in. Cool.

dimitris-athanasiou · 2018-06-12T11:10:52Z

lib/model/CAnomalyDetectorModel.cc

      // if we were going to persist the data gatherer from within this class.
      // We don't, so that's OK, but the next issue is that another thread will be
-      // modifying the data gatherer m_DataGatherer points to whilst this object
+      // modifying the data gatherer m_DataGatherer points too whilst this object


This seems like a typo

This was a typo before, the to[o] in this context means as well which is too rather than to.

It's still kind of a weird sentence but fair enough!

Wait a sec, I'm sorry you're actually completely right. I'd somehow (repeatedly) misread!

dimitris-athanasiou · 2018-06-12T12:00:07Z

lib/model/CEventRateModel.cc


 std::size_t CEventRateModel::memoryUsage() const {
-    return this->CIndividualModel::memoryUsage() +
-           core::CMemory::dynamicSize(m_InterimBucketCorrector);


Where is the memory for the interim bucket corrected accounted for now?

It is getting accounted for in this->CEventRateModel::computeMemoryUsage(). This is all tied up with the model memory estimation process we use, i.e. measuring the computed memory usage periodically then using a regression on those measurements. The extra memory used is effectively accounted for in that regressions' parameters.

dimitris-athanasiou · 2018-06-12T12:12:57Z

lib/model/CResourceMonitor.cc

+    // will be the overwhelmingly common source of additional memory
+    // so the model memory should be accurate (on average) in this
+    // time frame.
+    double scale{1.0 - static_cast<double>(elapsedTime) /


It seems the scale is fixed for a given bucket span. Should we consider setting it on construction?

We could do given current usage. Although in the current usage this is called at the bucket frequency this doesn't feel like it has to be the case and we'd lose the ability to do this if that changed. Also, I think if we move this away from the API to adjust the margin it hides an important part of the implementation from the caller. If I changed how this function is called I think it would be easy to overlook that I need to change the value computed in the constructor. I think on balance I prefer to keep as is for that reason. What do you think?

I agree. This gives us flexibility to change the way it works if needed in the future.

dimitris-athanasiou

LGTM

…31289) To avoid temporary failures, this also disables these tests until elastic/ml-cpp#122 is committed.

tveasey added 3 commits May 31, 2018 17:58

Fix issues in memory accounting and control plus some optimisations t…

9f0ee3a

…o data gatherer footprint

Fix unit tests

55e6dbe

Merge branch 'master' into bug/memory-accounting

e89c13c

tveasey added >enhancement v7.0.0 review :ml v6.4.0 affects-results labels Jun 8, 2018

tveasey requested a review from dimitris-athanasiou June 8, 2018 14:41

A couple of bug fixes, some more comments, make decrease to margin

f927102

a function of elapsed time not buckets and assert on memory used vs target in limit test.

tveasey force-pushed the bug/memory-accounting branch from 764246f to f927102 Compare June 8, 2018 14:43

Change log

af04280

dimitris-athanasiou reviewed Jun 12, 2018

View reviewed changes

dimitris-athanasiou approved these changes Jun 12, 2018

View reviewed changes

Review comment

fe5a063

tveasey mentioned this pull request Jun 13, 2018

[ML] Update test thresholds to account for changes to memory control elastic/elasticsearch#31289

Merged

tveasey added a commit to elastic/elasticsearch that referenced this pull request Jun 13, 2018

[ML] Update test thresholds to account for changes to memory control (#…

66f7dd2

…31289) To avoid temporary failures, this also disables these tests until elastic/ml-cpp#122 is committed.

Merge branch 'master' into bug/memory-accounting

b9413d9

tveasey merged commit fae7f38 into elastic:master Jun 13, 2018

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Jun 14, 2018

[ML] Improve the accuracy of model memory control (elastic#122)

52d1dcc

tveasey mentioned this pull request Jun 14, 2018

[6.4][ML] Improve the accuracy of model memory control #125

Merged

tveasey added a commit to elastic/elasticsearch that referenced this pull request Jun 14, 2018

[ML] Update test thresholds to account for changes to memory control (#…

d52dbf9

…31289) To avoid temporary failures, this also disables these tests until elastic/ml-cpp#122 is committed.

tveasey deleted the bug/memory-accounting branch March 22, 2019 09:55

Conversation

tveasey commented Jun 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tveasey Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tveasey Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tveasey Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tveasey commented Jun 8, 2018 •

edited

Loading

tveasey Jun 12, 2018 •

edited

Loading

tveasey Jun 12, 2018 •

edited

Loading

tveasey Jun 12, 2018 •

edited

Loading