[ML] Performance improvements for categorization jobs#89824
[ML] Performance improvements for categorization jobs#89824edsavage merged 6 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/ml-core (Team:ML) |
Please split this part out into a separate PR so that it can be backported to 8.4 and 7.17 and it's clear what got backported and what didn't. Also, I realised that we should truncate at 1001 characters on the Java side to preserve the behaviour that we add an ellipsis to the truncated example if truncation is necessary. |
Split out the cap on the categorization field length to a separate PR. This PR adds a limit filter to the ml_standard tokenizer which caps the number of tokens passed to the backend at 20.
07d8d23 to
3f87eda
Compare
| public static final ParseField TOKEN_FILTERS = AnalyzeAction.Fields.TOKEN_FILTERS; | ||
| public static final ParseField CHAR_FILTERS = AnalyzeAction.Fields.CHAR_FILTERS; | ||
|
|
||
| public static final int MAX_TOKEN_COUNT = 20; |
There was a problem hiding this comment.
I think the limit should be 100 to match this:
Restricting it from unlimited to 20 in a single release is pretty radical and I don't think we understand all the consequences of doing that.
If 21-100 tokens makes the C++ too slow then I think we need to profile the C++ code and figure out why and see if we can optimise that code.
Increase MAX_TOKEN_COUNT to 100
|
There are a couple of YAML tests that need updating to reflect the change:
|
|
@elasticmachine update branch |
|
@elasticmachine update branch |
* main: (34 commits) Make sure ivy repo directory exists before downloading artifacts Use 'file://' scheme for local repository URL Use DRA artifacts for release build CI jobs Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241) Script: Write Field API path manipulation (elastic#89889) Fetch health info action (elastic#89820) Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935) [ML] Performance improvements for categorization jobs (elastic#89824) [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931) Fix deadlock bug exposed by a test (elastic#89934) [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497) Fix segment stats in tsdb (elastic#89754) Synthetic _source: support dense_vector (elastic#89840) REST tests fetching fields with synthetic _source (elastic#89888) Do not deserialize back BytesTransportRequest to clone a request in MockTransportService (elastic#89926) Add SDK request logging to debug failures of S3BlobStoreRepositoryTests#testRequestStats (elastic#89912) Fix SnapshotStatusApisIT.testGetSnapshotsWithSnapshotInProgress (elastic#89925) Document synthetic source for text and keyword (elastic#89893) Fix CloneSnapshotIT.testRemoveFailedCloneFromCSWithQueuedSnapshotInProgress (elastic#89914) Add missing index.mapping.total_fields.limit setting to the target index (elastic#89875) ...
* main: (176 commits) Fix RandomSamplerAggregatorTests testAggregationSamplingNestedAggsScaled test failure (elastic#89958) [Downsampling] Replace document map with SMILE encoded doc (elastic#89495) Remove full cluster state from error logging in MasterService (elastic#89960) [ML] Truncate categorization fields (elastic#89827) [TSDB] Removed `summary` and `histogram` metric types (elastic#89937) Update testNodeSelectorRouting so that it does not depend on iteration order (elastic#89879) Make sure listener is resolved when file queue is cleared (elastic#89929) [Stable plugin api] Extensible annotation (elastic#89903) Fix double sending of response in TransportOpenIdConnectPrepareAuthenticationAction (elastic#89930) Make sure ivy repo directory exists before downloading artifacts Use 'file://' scheme for local repository URL Use DRA artifacts for release build CI jobs Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241) Script: Write Field API path manipulation (elastic#89889) Fetch health info action (elastic#89820) Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935) [ML] Performance improvements for categorization jobs (elastic#89824) [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931) Fix deadlock bug exposed by a test (elastic#89934) [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497) ...
Categorization of strings which break down to a huge number of tokens can cause the C++ backend process to choke - see elastic/ml-cpp#2403.
This PR adds a limit filter to the default categorization analyzer which caps the number of tokens passed to the backend at 100.
Unfortunately this isn't a complete panacea to all the issues surrounding categorization of many tokened / large messages as verification checks on the frontend can also fail due to calls to the datafeed
_previewAPI returning an excessive amount of data.