categorize_text takes a long time on small data sets with low cardinality

I am running a `categorize_text` aggregation on a relatively small data set, and the response time is about 18s for 185k docs, and three aggregations, so each aggregation processes about 30k docs/s. I briefly spoke to @droberts195 about this and we are not sure if this is expected. Each aggregation (after removing/adding) seems to take up a third of the response time. 

In this case, the aggregations run over `process.executable.text`, `user.name.text` and `host.os.name.text`, which are `match_only_text` fields. Some observations:

- These are low cardinality fields - 1 or 2 values per field
- The aggregation returns no results
- Running the aggregation over the keyword siblings _does_ return results, and take up the same amount of time.

My questions:

- is 30k docs per second per shard expected?
- can we speed it up (significantly)?
- is not returning any results expected?
- is performance affected by the cardinality of the data set, or only the document count (documents with values)? what other factors come into play?
- how do we determine whether results are statistically significant?

Context here is that I'm trying to figure out if we can make the log rate analysis API realtime (e.g. instead of a double-digit seconds response). The goal here is to get statistically significant results within <=2.5s from the API, and I'm trying to figure out what options we have to get there, e.g. use a sampler agg and re-poll with a greater probability if the results are not significant, etc.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

categorize_text takes a long time on small data sets with low cardinality #106166

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

categorize_text takes a long time on small data sets with low cardinality #106166

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions