Calcite patterns command brain pattern method#3570
Conversation
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
docs/user/ppl/cmd/patterns.rst
Outdated
| * byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions. | ||
| * pattern_method: optional. Specify pattern method to be simple_pattern. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. | ||
| * pattern_mode: optional. label mode or aggregation mode. Default is label mode. | ||
| * pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode. |
There was a problem hiding this comment.
if it is optional, add default value in doc.
There was a problem hiding this comment.
Done. Added default value
docs/user/ppl/cmd/patterns.rst
Outdated
| * pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. | ||
| * pattern_mode: optional. label mode or aggregation mode. Default is label mode. | ||
| * pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode. | ||
| * pattern_buffer_limit: optional. This is a special safeguard parameter for BRAIN algorithm to limit internal temporary buffer to hold processed logs. |
docs/user/ppl/cmd/patterns.rst
Outdated
| * pattern_max_sample_count: optional. The max sample logs to be returned per pattern in aggregation mode. | ||
| * pattern_buffer_limit: optional. This is a special safeguard parameter for BRAIN algorithm to limit internal temporary buffer to hold processed logs. |
There was a problem hiding this comment.
if it is optional, add default value in doc.
There was a problem hiding this comment.
Done. Added default value.
docs/user/ppl/cmd/patterns.rst
Outdated
| or | ||
|
|
||
| patterns [new_field=<new-field-name>] [pattern=<pattern>] <field> SIMPLE_PATTERN | ||
| patterns <field> [by byClause...] pattern_method=SIMPLE_PATTERN [pattern_mode=LABEL | AGGREGATION] [pattern_max_sample_count=integer] [new_field=<new-field-name>] [pattern=<pattern>] |
There was a problem hiding this comment.
nit,
all options are under pattern command, we can simpliy it, for instance pattern_method -> method, pattern_mode -> mode
There was a problem hiding this comment.
Renamed those options
docs/user/ppl/cmd/patterns.rst
Outdated
|
|
||
| * field: mandatory. The field must be a text field. | ||
| * byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions. | ||
| * pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. |
There was a problem hiding this comment.
brain or BRAIN, is it case-sensitive?
There was a problem hiding this comment.
I think it should be case-insensitive. @songkant-aws please double check. I suggest to use lower case in syntax doc.
There was a problem hiding this comment.
It's case-insensitive. Now they are all in lower cases in syntax doc.
docs/user/ppl/cmd/patterns.rst
Outdated
|
|
||
| * field: mandatory. The field must be a text field. | ||
| * byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions. | ||
| * pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. |
There was a problem hiding this comment.
By default, it's simple_pattern if the setting
plugins.ppl.default.pattern.methodis not specified.
The default value is configured by the setting plugins.ppl.default.pattern.method.
docs/user/ppl/cmd/patterns.rst
Outdated
| * field: mandatory. The field must be a text field. | ||
| * byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions. | ||
| * pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. | ||
| * pattern_mode: optional. label mode or aggregation mode. Default is label mode. |
There was a problem hiding this comment.
ditto, The default value is configured by the setting?
| +-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
| | patterns_field | pattern_count | sample_logs | | ||
| |-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | <*IP*> - <*> [<*>/Sep/<*>:<*>:<*>:<*> <*>] <*> <*> HTTP/<*><*>" <*> <*> | 4 | [177.95.8.74 - upton5450 [28/Sep/2022:10:15:57 -0700] "HEAD /e-business/mindshare HTTP/1.0" 404 19927,127.45.152.6 - pouros8756 [28/Sep/2022:10:15:57 -0700] "GET /architectures/convergence/niches/mindshare HTTP/1.0" 100 28722,118.223.210.105 - - [28/Sep/2022:10:15:57 -0700] "PATCH /strategize/out-of-the-box HTTP/1.0" 401 27439,210.204.15.104 - - [28/Sep/2022:10:15:57 -0700] "POST /users HTTP/1.1" 301 9481] | |
There was a problem hiding this comment.
- In aggregation mode, does the pattern command collect sample values of IP addresses?
- What is the output syntax of the pattern_fields? Can it be used directly in search queries?
There was a problem hiding this comment.
- in IT, the detected pattern is
PacketResponder failed <token1> blk_<token2>what does IP means, is it token?
There was a problem hiding this comment.
- Yes, when Calcite is enabled, the sample logs will be converted to sample tokens in different position.
- pattern_field will be string in format like
PacketResponder failed <token1> blk_<token2>. tokens will be a map like {token1: [...], token2: [...]}. Not sure what does it mean by using them directly in search queries? - <IP> is one of variable placeholder of BrainLogParser's output. Yes, it's token in V2 output format.
I have updated the syntax docs with more examples. When Calcite is enabled, the output syntax is a pattern string with <token*> placeholder plus a map of corresponding tokens. User can leverage those two output columns for further query.
There was a problem hiding this comment.
Not sure what does it mean by using them directly in search queries?
For instance,
As user, the detected pattern is <token1> - <token2> [<token3>/Sep/<token4>:<token5>:<token6>:<token7> <token8>] <token9> <token10> HTTP/<token11><token12>\" <token13> <token14>, I want to search logs match this pattern.
I think it can not been directly used, need to rewise the query, it is out of scope this PR.
LantaoJin
left a comment
There was a problem hiding this comment.
Please update the code base and address the latest commends.
docs/user/ppl/cmd/patterns.rst
Outdated
|
|
||
| * field: mandatory. The field must be a text field. | ||
| * byClause: optional. The log groups to be labeled or aggregated. It could be fields and scalar functions. | ||
| * pattern_method: optional. Specify pattern method to be brain. By default, it's simple_pattern if the setting ``plugins.ppl.default.pattern.method`` is not specified. |
There was a problem hiding this comment.
I think it should be case-insensitive. @songkant-aws please double check. I suggest to use lower case in syntax doc.
| PATTERN_MODE: 'PATTERN_MODE'; | ||
| PATTERN_METHOD: 'PATTERN_METHOD'; | ||
| PATTERN_MAX_SAMPLE_COUNT: 'PATTERN_MAX_SAMPLE_COUNT'; | ||
| PATTERN_BUFFER_LIMIT: 'PATTERN_BUFFER_LIMIT'; |
There was a problem hiding this comment.
How above simplify the arguments by remove pattern_ prefix?
There was a problem hiding this comment.
Removed pattern_ prefix
| DEFAULT_PATTERN_METHOD("plugins.ppl.default.pattern.method"), | ||
| DEFAULT_PATTERN_MODE("plugins.ppl.default.pattern.mode"), | ||
| DEFAULT_PATTERN_MAX_SAMPLE_COUNT("plugins.ppl.default.pattern.max.sample.count"), | ||
| DEFAULT_PATTERN_BUFFER_LIMIT("plugins.ppl.default.pattern.buffer.limit"), |
There was a problem hiding this comment.
Can we remove the default part? It's meaningless IMO.
plugins.ppl.default.pattern.method -> plugins.ppl.pattern.method
There was a problem hiding this comment.
Removed default part
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
Signed-off-by: Songkan Tang <songkant@amazon.com>
| * max_sample_count: optional. Max sample logs returned per pattern in aggregation mode (default: 10). The max_sample_count is configured by the setting ``plugins.ppl.pattern.max.sample.count``. | ||
| * buffer_limit: optional. Safeguard parameter for ``brain`` algorithm to limit internal temporary buffer size (default: 100,000, min: 50,000). The buffer_limit is configured by the setting ``plugins.ppl.pattern.buffer.limit``. | ||
| * new_field: Alias of the output pattern field. (default: "patterns_field"). | ||
| * algorithm parameters: optional. Algorithm-specific tuning: |
There was a problem hiding this comment.
nit, this line is in italic format, is it expected? https://github.com/opensearch-project/sql/blob/8f468a5b92b4b3a2ca5fb2798f78f329c88b4581/docs/user/ppl/cmd/patterns.rst#syntax
| * new_field: Alias of the output pattern field. (default: "patterns_field"). | ||
| * algorithm parameters: optional. Algorithm-specific tuning: | ||
| - ``simple_pattern`` : Define regex via "pattern". | ||
| - ``brain`` : Adjust sensitivity with variable_count_threshold (int > 0) and frequency_threshold_percentage (double 0.0 - 1.0). |
There was a problem hiding this comment.
nit, explain what is variable_count_threshold and frequency_threshold_percentage
| +-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
| | patterns_field | pattern_count | sample_logs | | ||
| |-------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | <*IP*> - <*> [<*>/Sep/<*>:<*>:<*>:<*> <*>] <*> <*> HTTP/<*><*>" <*> <*> | 4 | [177.95.8.74 - upton5450 [28/Sep/2022:10:15:57 -0700] "HEAD /e-business/mindshare HTTP/1.0" 404 19927,127.45.152.6 - pouros8756 [28/Sep/2022:10:15:57 -0700] "GET /architectures/convergence/niches/mindshare HTTP/1.0" 100 28722,118.223.210.105 - - [28/Sep/2022:10:15:57 -0700] "PATCH /strategize/out-of-the-box HTTP/1.0" 401 27439,210.204.15.104 - - [28/Sep/2022:10:15:57 -0700] "POST /users HTTP/1.1" 301 9481] | |
There was a problem hiding this comment.
Not sure what does it mean by using them directly in search queries?
For instance,
As user, the detected pattern is <token1> - <token2> [<token3>/Sep/<token4>:<token5>:<token6>:<token7> <token8>] <token9> <token10> HTTP/<token11><token12>\" <token13> <token14>, I want to search logs match this pattern.
I think it can not been directly used, need to rewise the query, it is out of scope this PR.
| import org.opensearch.sql.common.patterns.PatternUtils.ParseResult; | ||
|
|
||
| public class LogPatternAggFunction implements UserDefinedAggFunction<LogParserAccumulator> { | ||
| private int bufferLimit = 100000; |
| } | ||
|
|
||
| public static class LogParserAccumulator implements Accumulator { | ||
| private final List<String> logMessages; |
There was a problem hiding this comment.
Access to logMessages is threadsafe?
* Revert simple_pattern window function change to recover pushdown ability Signed-off-by: Songkan Tang <songkant@amazon.com> * Add SIMPLE_PATTERN patterns command support based on parse command Signed-off-by: Songkan Tang <songkant@amazon.com> * Address minor comments Signed-off-by: Songkan Tang <songkant@amazon.com> * Address comments part 2 Signed-off-by: Songkan Tang <songkant@amazon.com> * Make allowCast for pattern VARCHAR literal Signed-off-by: Songkan Tang <songkant@amazon.com> * Fix spotless Signed-off-by: Songkan Tang <songkant@amazon.com> * Minor ut failure fix Signed-off-by: Songkan Tang <songkant@amazon.com> * Brain patterns command in Calcite with combined UDF and UDAF Signed-off-by: Songkan Tang <songkant@amazon.com> * Revert debug flag Signed-off-by: Songkan Tang <songkant@amazon.com> * Minor ut failure fix Signed-off-by: Songkan Tang <songkant@amazon.com> * Minor ut failure fix part2 Signed-off-by: Songkan Tang <songkant@amazon.com> * Pick missing ast Window from main Signed-off-by: Songkan Tang <songkant@amazon.com> * Support agg and label mode and new model for patterns command Signed-off-by: Songkan Tang <songkant@amazon.com> * Remove unnecessary files and comments Signed-off-by: Songkan Tang <songkant@amazon.com> * Use uncollect_patterns table function to flatten patterns list Signed-off-by: Songkan Tang <songkant@amazon.com> * Fix partial UT Signed-off-by: Songkan Tang <songkant@amazon.com> * Add 3570 yaml tests Signed-off-by: Songkan Tang <songkant@amazon.com> * Fix plans in explain ITs Signed-off-by: Songkan Tang <songkant@amazon.com> * Fix pushdown ITs failure Signed-off-by: Songkan Tang <songkant@amazon.com> * Fix doctest examples for V2 engine results Signed-off-by: Songkan Tang <songkant@amazon.com> * Minor fix after rebasing Signed-off-by: Songkan Tang <songkant@amazon.com> * Uncomment build.gradle change Signed-off-by: Songkan Tang <songkant@amazon.com> * Address minor comment Signed-off-by: Songkan Tang <songkant@amazon.com> * Address patterns doc comments and fix conflicts Signed-off-by: Songkan Tang <songkant@amazon.com> * Fix doctest Signed-off-by: Songkan Tang <songkant@amazon.com> * Reuse expand command plan to replace hacky uncollect_patterns UDTF Signed-off-by: Songkan Tang <songkant@amazon.com> * Minor fix after resolving merge conflicts Signed-off-by: Songkan Tang <songkant@amazon.com> * Refactor duplicate building expand rel node logic Signed-off-by: Songkan Tang <songkant@amazon.com> * Fix the issue of expand command plan executing main query twice Signed-off-by: Songkan Tang <songkant@amazon.com> --------- Signed-off-by: Songkan Tang <songkant@amazon.com> Signed-off-by: xinyual <xinyual@amazon.com>
Description
This aims to resolve #3569
BRAINpattern method ofPatternscommand is implemented by combined UDF and UDAF.Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.