[ML] Use a new ML endpoint to estimate a model memory#60376
[ML] Use a new ML endpoint to estimate a model memory#60376darnautov merged 18 commits intoelastic:masterfrom
Conversation
|
Pinging @elastic/ml-ui (:ml) |
droberts195
left a comment
There was a problem hiding this comment.
The overall structure looks fine to me.
I left a few comments about how the ES endpoint is being called. (Since I don't know anything about JavaScript or TypeScript someone else needs to review from that perspective.)
x-pack/plugins/ml/server/models/calculate_model_memory_limit/calculate_model_memory_limit.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/ml/server/models/calculate_model_memory_limit/calculate_model_memory_limit.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/ml/server/models/calculate_model_memory_limit/calculate_model_memory_limit.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/ml/server/models/calculate_model_memory_limit/calculate_model_memory_limit.ts
Outdated
Show resolved
Hide resolved
droberts195
left a comment
There was a problem hiding this comment.
Thanks for the updates Dima. I cannot see anything wrong with this now, but please wait for a review from someone who knows the Kibana code well before merging.
x-pack/plugins/ml/server/routes/schemas/anomaly_detectors_schema.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/ml/server/models/job_validation/validate_model_memory_limit.test.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/ml/server/models/job_validation/validate_model_memory_limit.ts
Outdated
Show resolved
Hide resolved
| indexPattern: string, | ||
| query: any, | ||
| timeFieldName: string, | ||
| earliestMs: number, |
There was a problem hiding this comment.
Just a note that in the follow-up bit of work, where the modules setup endpoint calls this calculateModelMemoryLimit function, that we need to handle the case where earliest and latest times are not supplied. In this case, the setup endpoint should find the start/end times of the latest 3 months of data, and pass those times to this function.
There was a problem hiding this comment.
Thanks for the heads-up, yes, I planned this changes for the follow-up PR dedicated to the setup endpoint
|
I tested the multi metric wizard, with the time range set to one where there is no data in the index, and the call to the This looks like it happens when you add an influencer which isn't the partitioning field. |
good catch, fixed in 7bdf515 |
|
I'm seeing errors when trying to calculate the mml when i have a max mml set in the es config. |
Thanks for checking this. I used |
| * Service for carrying out queries to obtain data | ||
| * specific to fields in Elasticsearch indices. | ||
| */ | ||
| export function fieldsServiceProvider(callAsCurrentUser: APICaller) { |
There was a problem hiding this comment.
Building a multi-metric job against the gallery data set, using the full time range, and a small bucket span, I am seeing circuit_breaking_exceptions in the network tab of the dev console as I add extra influencers.
{"statusCode":429,"error":"Too Many Requests","message":"[circuit_breaking_exception] [parent] Data too large, data for [<reused_arrays>] would be [988102840/942.3mb], which is larger than the limit of [986932838/941.2mb], real usage: [988102768/942.3mb], new bytes reserved: [72/72b], usages [request=147475312/140.6mb, fielddata=228074/222.7kb, accounting=1282468/1.2mb, inflight_requests=784/784b], with { bytes_wanted=988102840 & bytes_limit=986932838 & durability=\"TRANSIENT\" } (and) [circuit_breaking_exception]...
We need to try and optimize the queries being done in here. There are probably two factors that are making it use a lot of memory:
- It’s doing all the fields in one query
- It’s using the full time range
If the bucket span is small and the time range long, then that’s a lot of time buckets e.g. 15 minute bucket span, 12 month time range.
So we probably need to be pragmatic about capping the number of buckets that we take the max over. Say last 1000 buckets. That should be easy to calculate knowing the bucket span and time range provided. Just adjust earliestMs to be max(earliestMs, latestMs - 1000 * interval) (obviously needs to be converted to ms).
As well as being defensive in the queries, you should probably also expose to the user that an error has occurred calling the estimate model memory limit endpoint e.g. in a toast notification. Currently you only see the error in the Kibana server console, or by looking in the browser network tab.
There was a problem hiding this comment.
Thanks for testing! For now, I added capping of the time range as well as replace must part of the query with filter to take advantage of caching, check 7ee91e7.
In a follow-up PR, I'll try to split the current ES query into multiple (in case there are plenty of influencers provided) and check how it performs.
|
@elasticmachine merge upstream |
peteharverson
left a comment
There was a problem hiding this comment.
LGTM. Capping to a max of 1000 buckets has cleared the circuit_breaker_exceptions I was seeing on the gallery data before. Would be good to expose any errors from the endpoint to the user in the follow-up.
💚 Build SucceededHistory
To update your PR or re-run it, just comment with: |
* [ML] refactor calculate_model_memory_limit route, use estimateModelMemory endpoint * [ML] refactor validate_model_memory_limit, migrate tests to jest * [ML] fix typing issue * [ML] start estimateModelMemory url with / * [ML] fix typo, filter mlcategory * [ML] extract getCardinalities function * [ML] fields_service.ts * [ML] wip getMaxBucketCardinality * [ML] refactor and comments * [ML] fix aggs keys with special characters, fix integration tests * [ML] use pre-defined job types * [ML] fallback to 0 in case max bucket cardinality receives null * [ML] calculateModelMemoryLimit on influencers change * [ML] fix maxModelMemoryLimit * [ML] cap aggregation to max 1000 buckets * [ML] rename intervalDuration
* [ML] refactor calculate_model_memory_limit route, use estimateModelMemory endpoint * [ML] refactor validate_model_memory_limit, migrate tests to jest * [ML] fix typing issue * [ML] start estimateModelMemory url with / * [ML] fix typo, filter mlcategory * [ML] extract getCardinalities function * [ML] fields_service.ts * [ML] wip getMaxBucketCardinality * [ML] refactor and comments * [ML] fix aggs keys with special characters, fix integration tests * [ML] use pre-defined job types * [ML] fallback to 0 in case max bucket cardinality receives null * [ML] calculateModelMemoryLimit on influencers change * [ML] fix maxModelMemoryLimit * [ML] cap aggregation to max 1000 buckets * [ML] rename intervalDuration Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
* master: [ML] Use a new ML endpoint to estimate a model memory (elastic#60376) [Logs UI] Correctly update the expanded log rate table rows (elastic#60306) fixes drag and drop flakiness (elastic#60625) Removing isEmptyState from embeddable input (elastic#60511) [Cross Cluster Replication] NP Shim (elastic#60121) Clear changes when canceling an edit to an alert (elastic#60518) Update workflow syntax (elastic#60626) Updating project assigner workflows to v2.0.0 of the action and back to default tokens (elastic#60577) migrate saved objects management edition view to react/typescript/eui (elastic#59490)
Summary
Part of #60386.
calculate_model_memory_limitkibana endpoint to use the ML_estimate_model_memoryAPI to retrieve a model memory estimation.validate_model_memory_limitto work with newcalculate_model_memory_limitChecklist