I am running a categorize_text aggregation on a relatively small data set, and the response time is about 18s for 185k docs, and three aggregations, so each aggregation processes about 30k docs/s. I briefly spoke to @droberts195 about this and we are not sure if this is expected. Each aggregation (after removing/adding) seems to take up a third of the response time.
In this case, the aggregations run over process.executable.text, user.name.text and host.os.name.text, which are match_only_text fields. Some observations:
- These are low cardinality fields - 1 or 2 values per field
- The aggregation returns no results
- Running the aggregation over the keyword siblings does return results, and take up the same amount of time.
My questions:
- is 30k docs per second per shard expected?
- can we speed it up (significantly)?
- is not returning any results expected?
- is performance affected by the cardinality of the data set, or only the document count (documents with values)? what other factors come into play?
- how do we determine whether results are statistically significant?
Context here is that I'm trying to figure out if we can make the log rate analysis API realtime (e.g. instead of a double-digit seconds response). The goal here is to get statistically significant results within <=2.5s from the API, and I'm trying to figure out what options we have to get there, e.g. use a sampler agg and re-poll with a greater probability if the results are not significant, etc.
I am running a
categorize_textaggregation on a relatively small data set, and the response time is about 18s for 185k docs, and three aggregations, so each aggregation processes about 30k docs/s. I briefly spoke to @droberts195 about this and we are not sure if this is expected. Each aggregation (after removing/adding) seems to take up a third of the response time.In this case, the aggregations run over
process.executable.text,user.name.textandhost.os.name.text, which arematch_only_textfields. Some observations:My questions:
Context here is that I'm trying to figure out if we can make the log rate analysis API realtime (e.g. instead of a double-digit seconds response). The goal here is to get statistically significant results within <=2.5s from the API, and I'm trying to figure out what options we have to get there, e.g. use a sampler agg and re-poll with a greater probability if the results are not significant, etc.