Skip to content

[ML] Performance improvements for categorization jobs#89824

Merged
edsavage merged 6 commits intoelastic:mainfrom
edsavage:categorization_token_limit
Sep 8, 2022
Merged

[ML] Performance improvements for categorization jobs#89824
edsavage merged 6 commits intoelastic:mainfrom
edsavage:categorization_token_limit

Conversation

@edsavage
Copy link
Copy Markdown
Contributor

@edsavage edsavage commented Sep 6, 2022

Categorization of strings which break down to a huge number of tokens can cause the C++ backend process to choke - see elastic/ml-cpp#2403.

This PR adds a limit filter to the default categorization analyzer which caps the number of tokens passed to the backend at 100.

Unfortunately this isn't a complete panacea to all the issues surrounding categorization of many tokened / large messages as verification checks on the frontend can also fail due to calls to the datafeed _preview API returning an excessive amount of data.

@edsavage edsavage added >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.5.0 labels Sep 6, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Copy Markdown

Also, the raw categorization field passed to the backend is truncated at 1000 characters.

Please split this part out into a separate PR so that it can be backported to 8.4 and 7.17 and it's clear what got backported and what didn't.

Also, I realised that we should truncate at 1001 characters on the Java side to preserve the behaviour that we add an ellipsis to the truncated example if truncation is necessary.

Split out the cap on the categorization field length to a separate PR.

This PR adds a limit filter to the ml_standard tokenizer which caps the number of tokens passed to the backend at 20.
@edsavage edsavage force-pushed the categorization_token_limit branch from 07d8d23 to 3f87eda Compare September 6, 2022 15:49
public static final ParseField TOKEN_FILTERS = AnalyzeAction.Fields.TOKEN_FILTERS;
public static final ParseField CHAR_FILTERS = AnalyzeAction.Fields.CHAR_FILTERS;

public static final int MAX_TOKEN_COUNT = 20;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the limit should be 100 to match this:

Restricting it from unlimited to 20 in a single release is pretty radical and I don't think we understand all the consequences of doing that.

If 21-100 tokens makes the C++ too slow then I think we need to profile the C++ code and figure out why and see if we can optimise that code.

Increase MAX_TOKEN_COUNT to 100
@droberts195
Copy link
Copy Markdown

There are a couple of YAML tests that need updating to reflect the change:

  1. ml/jobs_crud/Test update job
  2. reference/ml/common/apis/get-ml-info:15 (in the docs)

Copy link
Copy Markdown

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@edsavage
Copy link
Copy Markdown
Contributor Author

edsavage commented Sep 8, 2022

@elasticmachine update branch

@edsavage
Copy link
Copy Markdown
Contributor Author

edsavage commented Sep 8, 2022

@elasticmachine update branch

@edsavage edsavage merged commit fd20027 into elastic:main Sep 8, 2022
@edsavage edsavage deleted the categorization_token_limit branch September 9, 2022 08:05
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 9, 2022
* main: (34 commits)
  Make sure ivy repo directory exists before downloading artifacts
  Use 'file://' scheme for local repository URL
  Use DRA artifacts for release build CI jobs
  Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241)
  Script: Write Field API path manipulation (elastic#89889)
  Fetch health info action (elastic#89820)
  Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935)
  [ML] Performance improvements for categorization jobs (elastic#89824)
  [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931)
  Fix deadlock bug exposed by a test (elastic#89934)
  [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497)
  Fix segment stats in tsdb (elastic#89754)
  Synthetic _source: support dense_vector (elastic#89840)
  REST tests fetching fields with synthetic _source (elastic#89888)
  Do not deserialize back BytesTransportRequest to clone a request in MockTransportService (elastic#89926)
  Add SDK request logging to debug failures of S3BlobStoreRepositoryTests#testRequestStats (elastic#89912)
  Fix SnapshotStatusApisIT.testGetSnapshotsWithSnapshotInProgress (elastic#89925)
  Document synthetic source for text and keyword (elastic#89893)
  Fix CloneSnapshotIT.testRemoveFailedCloneFromCSWithQueuedSnapshotInProgress (elastic#89914)
  Add missing index.mapping.total_fields.limit setting to the target index (elastic#89875)
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 9, 2022
* main: (176 commits)
  Fix RandomSamplerAggregatorTests testAggregationSamplingNestedAggsScaled test failure (elastic#89958)
  [Downsampling] Replace document map with SMILE encoded doc (elastic#89495)
  Remove full cluster state from error logging in MasterService (elastic#89960)
  [ML] Truncate categorization fields (elastic#89827)
  [TSDB] Removed `summary` and `histogram` metric types (elastic#89937)
  Update testNodeSelectorRouting so that it does not depend on iteration order (elastic#89879)
  Make sure listener is resolved when file queue is cleared (elastic#89929)
  [Stable plugin api] Extensible annotation (elastic#89903)
  Fix double sending of response in TransportOpenIdConnectPrepareAuthenticationAction (elastic#89930)
  Make sure ivy repo directory exists before downloading artifacts
  Use 'file://' scheme for local repository URL
  Use DRA artifacts for release build CI jobs
  Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241)
  Script: Write Field API path manipulation (elastic#89889)
  Fetch health info action (elastic#89820)
  Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935)
  [ML] Performance improvements for categorization jobs (elastic#89824)
  [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931)
  Fix deadlock bug exposed by a test (elastic#89934)
  [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :ml Machine learning Team:ML Meta label for the ML team v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants