[ML] Performance improvements for categorization jobs by edsavage · Pull Request #89824 · elastic/elasticsearch

edsavage · 2022-09-06T15:06:10Z

Categorization of strings which break down to a huge number of tokens can cause the C++ backend process to choke - see elastic/ml-cpp#2403.

This PR adds a limit filter to the default categorization analyzer which caps the number of tokens passed to the backend at 100.

Unfortunately this isn't a complete panacea to all the issues surrounding categorization of many tokened / large messages as verification checks on the frontend can also fail due to calls to the datafeed _preview API returning an excessive amount of data.

elasticsearchmachine · 2022-09-06T15:06:36Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2022-09-06T15:12:46Z

Also, the raw categorization field passed to the backend is truncated at 1000 characters.

Please split this part out into a separate PR so that it can be backported to 8.4 and 7.17 and it's clear what got backported and what didn't.

Also, I realised that we should truncate at 1001 characters on the Java side to preserve the behaviour that we add an ellipsis to the truncated example if truncation is necessary.

Split out the cap on the categorization field length to a separate PR. This PR adds a limit filter to the ml_standard tokenizer which caps the number of tokens passed to the backend at 20.

droberts195 · 2022-09-06T17:06:52Z

...e/src/main/java/org/elasticsearch/xpack/core/ml/job/config/CategorizationAnalyzerConfig.java

    public static final ParseField TOKEN_FILTERS = AnalyzeAction.Fields.TOKEN_FILTERS;
    public static final ParseField CHAR_FILTERS = AnalyzeAction.Fields.CHAR_FILTERS;

+    public static final int MAX_TOKEN_COUNT = 20;


I think the limit should be 100 to match this:

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/TokenListCategorizer.java

Line 43 in 93bc2e3

public static final int MAX_TOKENS = 100;

Restricting it from unlimited to 20 in a single release is pretty radical and I don't think we understand all the consequences of doing that.

If 21-100 tokens makes the C++ too slow then I think we need to profile the C++ code and figure out why and see if we can optimise that code.

Increase MAX_TOKEN_COUNT to 100

droberts195 · 2022-09-08T08:18:17Z

There are a couple of YAML tests that need updating to reflect the change:

ml/jobs_crud/Test update job
reference/ml/common/apis/get-ml-info:15 (in the docs)

droberts195

LGTM

edsavage · 2022-09-08T16:00:27Z

@elasticmachine update branch

edsavage · 2022-09-08T16:41:27Z

@elasticmachine update branch

* main: (34 commits) Make sure ivy repo directory exists before downloading artifacts Use 'file://' scheme for local repository URL Use DRA artifacts for release build CI jobs Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241) Script: Write Field API path manipulation (elastic#89889) Fetch health info action (elastic#89820) Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935) [ML] Performance improvements for categorization jobs (elastic#89824) [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931) Fix deadlock bug exposed by a test (elastic#89934) [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497) Fix segment stats in tsdb (elastic#89754) Synthetic _source: support dense_vector (elastic#89840) REST tests fetching fields with synthetic _source (elastic#89888) Do not deserialize back BytesTransportRequest to clone a request in MockTransportService (elastic#89926) Add SDK request logging to debug failures of S3BlobStoreRepositoryTests#testRequestStats (elastic#89912) Fix SnapshotStatusApisIT.testGetSnapshotsWithSnapshotInProgress (elastic#89925) Document synthetic source for text and keyword (elastic#89893) Fix CloneSnapshotIT.testRemoveFailedCloneFromCSWithQueuedSnapshotInProgress (elastic#89914) Add missing index.mapping.total_fields.limit setting to the target index (elastic#89875) ...

* main: (176 commits) Fix RandomSamplerAggregatorTests testAggregationSamplingNestedAggsScaled test failure (elastic#89958) [Downsampling] Replace document map with SMILE encoded doc (elastic#89495) Remove full cluster state from error logging in MasterService (elastic#89960) [ML] Truncate categorization fields (elastic#89827) [TSDB] Removed `summary` and `histogram` metric types (elastic#89937) Update testNodeSelectorRouting so that it does not depend on iteration order (elastic#89879) Make sure listener is resolved when file queue is cleared (elastic#89929) [Stable plugin api] Extensible annotation (elastic#89903) Fix double sending of response in TransportOpenIdConnectPrepareAuthenticationAction (elastic#89930) Make sure ivy repo directory exists before downloading artifacts Use 'file://' scheme for local repository URL Use DRA artifacts for release build CI jobs Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241) Script: Write Field API path manipulation (elastic#89889) Fetch health info action (elastic#89820) Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935) [ML] Performance improvements for categorization jobs (elastic#89824) [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931) Fix deadlock bug exposed by a test (elastic#89934) [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497) ...

edsavage added >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.5.0 labels Sep 6, 2022

[ML] Performance improvements for categorization jobs

3f87eda

Split out the cap on the categorization field length to a separate PR. This PR adds a limit filter to the ml_standard tokenizer which caps the number of tokens passed to the backend at 20.

edsavage force-pushed the categorization_token_limit branch from 07d8d23 to 3f87eda Compare September 6, 2022 15:49

droberts195 reviewed Sep 6, 2022

View reviewed changes

Attend to code review comments

8e8ace6

Increase MAX_TOKEN_COUNT to 100

edsavage added 2 commits September 8, 2022 10:52

Fix failing yaml rest tests

4e60aec

Further fixes for failing tests

2235464

droberts195 approved these changes Sep 8, 2022

View reviewed changes

Merge branch 'main' into categorization_token_limit

867275b

Merge branch 'main' into categorization_token_limit

9999fb5

edsavage merged commit fd20027 into elastic:main Sep 8, 2022

edsavage deleted the categorization_token_limit branch September 9, 2022 08:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Performance improvements for categorization jobs#89824

[ML] Performance improvements for categorization jobs#89824
edsavage merged 6 commits intoelastic:mainfrom
edsavage:categorization_token_limit

edsavage commented Sep 6, 2022 •

edited by droberts195

Loading

Uh oh!

elasticsearchmachine commented Sep 6, 2022

Uh oh!

droberts195 commented Sep 6, 2022

Uh oh!

droberts195 Sep 6, 2022

Uh oh!

droberts195 commented Sep 8, 2022

Uh oh!

droberts195 left a comment

Uh oh!

edsavage commented Sep 8, 2022

Uh oh!

edsavage commented Sep 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

edsavage commented Sep 6, 2022 • edited by droberts195 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 6, 2022

Uh oh!

droberts195 commented Sep 6, 2022

Uh oh!

droberts195 Sep 6, 2022

Choose a reason for hiding this comment

Uh oh!

droberts195 commented Sep 8, 2022

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

edsavage commented Sep 8, 2022

Uh oh!

edsavage commented Sep 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

edsavage commented Sep 6, 2022 •

edited by droberts195

Loading