[RFC] Improve patterns command with more advanced log pattern algorithms

## Problem Statement
OpenSearch PPL currently supports [patterns command](https://github.com/opensearch-project/sql/blob/main/docs/user/ppl/cmd/patterns.rst) to extract log patterns by deriving a `patterns_field` from selected log message field. The default way it generates log pattern field is a simple approach that ignores specified chars and treat the remaining chars as the pattern. Users can query top N log patterns by grouping log messages by the new derived log pattern field like this query 
`source=t | patterns message | stats count() as count, take(message, 100) by patterns_field | sort - count`. We will call this current approach the simple log pattern in the whole doc. The problem of this approach is its grouping accuracy is relatively low in most of industrial software logs. To achieve relatively high grouping accuracy, it needs ops engineer to have specific domain knowledge and manually pass regex pattern to this command. Today's industrial log statements evolve quickly, manually parsing log pattern becomes a challenging task. So we want to introduce more advanced log pattern algorithm to automatically extract high grouping accuracy log patterns.

### Current State
Though this patterns command is simple, it has following downsides:

1. The default generated pattern is usually a combination of punctuations by removing alphanumerical chars, which are not human-readable. i.e. `"... - - [--::.] \" / /.\"   \"-\" \"/. (;  _; :.) / /.\"` for original message `117.251.123.89 - - [2018-07-30T09:46:45.214Z] \"GET /apm HTTP/1.1\" 200 8002 \"-\" \"Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1\`
2. It doesn't have high grouping accuracy when there are multiple log statements share the same punctuation pattern. Plus, text tokens dominate in industrial software log messages. i.e. These three different log messages will be treated as sharing the same pattern ` , .`. for `Server started, waiting.`, `Server stopped, waiting.`, `Changing view, hdfs.`. 

## Long-Term Goals
* Improve patterns command with automatic log pattern extraction algorithms(or with minimal human intervention)
* Improve log pattern grouping accuracy
* Scale log pattern extraction across millions of log messages

## Proposal
1. Empower patterns command with advanced log pattern algorithms mentioned in [previously proposed OpenSearch RFC](https://github.com/opensearch-project/OpenSearch/issues/16627).
2. Refactor patterns command as a standalone abstraction(currently it's an implementation of ParseExpression), considering there are different methods of log pattern extraction.
3. Push down complex log pattern extraction code to OpenSearch(probably by painless script) to achieve distributed log pattern extraction. _(**Probably out of scope**: We could invest a code generation module in sql plugin that translates physical plan's Volcano model to a single function that fuses multiple operators together.)_

## Approach
As proposed, we will make patterns command a standalone abstraction. Current simple log pattern fuses a PatternsExpression into Project operator. The PatternsExpression is bound to the target log message field and applies regex on the log message field to generate parsed result as new derived pattern field in Project operator.

Simple log pattern's computing paradigm fits very well in Volcano model's iterator pattern, aka processing result from upstream operator row by row. However, the proposed log pattern algorithm called Brain in previous RFC conflicts with it. The computing paradigm of Brain algorithm is to iterate all of input messages firstly to calculate something like token histogram and then it denotes variable tokens row by row for input messages based on that histogram. So it's hard to extend current ParseExpression to support it.

Luckily, we observed that log pattern algorithms can be classified as two types: ONLINE and OFFLINE. ONLINE algorithms are kind of streaming algorithm. Simple log pattern and other ONLINE algorithms fall into this category. OFFLINE algorithms have to iterate all of log messages firstly and then figure out log pattern in the second pass or more passes. Window operator happens to satisfy both computing paradigms. For ONLINE log pattern algorithms, we can create OpenSearch specialized streaming window functions, just like SQL's ROW_NUMBER(), RANK(), etc. For OFFLINE log pattern algorithms, we can create OpenSearch specialized buffering window functions, just like SQL's SUM(), PERCENT_RANK(), etc. The difference is that the log pattern buffering window function has an n-to-n relationship, rather than an n-to-1 relationship.

### First Stage

In first stage, we want to focus on grouping log patterns on OpenSearch coordinator node. I will use simple plan representation in text with explanation to elaborate the implementation draft.

For the following simple PPL query:
`source=t | patterns message`

#### Current Simple Log Pattern Plan
The current final physical plan is like:
```
ProjectOperator[message#1, PatternsExpression(message#1) as patterns_field#2]
+- OpenSearchIndexScan
```
The above plan means patterns_field is a new field with the evaluation result of PatternsExpression(message#1)

#### New Simple Log Pattern Plan
After introducing consolidation pattern window functions, the query could be(For now, I reuse the SQL syntax to represent a window operator in PPL):
`source=t | patterns(SIMPLE, message#1, other algorithm specific arguments...) over()`

The final plan for original Simple log pattern will be changed to:
```
Project[message#1, patterns_field#2]
+- Window[simple(message#1), windowsSpec(partitionBy=null, frame=CurrentRowWindowFrame)]
   +- OpenSearchIndexScan
```
The above plan means while analyzing unresolved plan of patterns command, if the algorithm method falls into ONLINE algorithm, we assign CurrentRowWindowFrame to window specification. CurrentRowWindowFrame only stores current value and previous value of the iterator. It's enough for the row by row iterating paradigm. This operator can be fused with other operators in a single pipeline, that means it's non-blocking.

#### New Brain Log Pattern Plan
The final plan for OFFLINE algorithms like Brain log pattern algorithm will be like:
```
Project[message#1, patterns_field#2]
+- Window[brain(message#1), windowsSpec(partitionBy=null, frame=PeerRowsWindowFrame)]
   +- OpenSearchIndexScan
```
The above plan means while analyzing unresolved plan of patterns command, if the algorithm method falls into OFFLINE algorithm, we assign PeerRowsWindowFrame to window specification. PeerRowsWindowFrame buffers each row value into peers, aka List<ExprValue>. While calling next() function, the algorithm will compute histograms before continuing iteration. This operation will be blocking like aggregation does. 

### Second Stage

The goal of second stage is to scale log pattern extraction by pushing down its execution to OpenSearch. This needs further research and testing. So leaving this part TBD.

## Alternative

### New operator
Instead of extending existing WindowOperator, we can create a brand new operator called LogPatternOperator to accommodate mentioned implementations. But this will have similar abstraction to window functions.

## Implementation Discussion
* If we adopt window function abstraction, do we want to support `partition by` and `sort by` for patterns command? I do see it could be useful but we haven't expect such requirements before.
* Different log pattern algorithms have different grouping accuracy over different logs. Do we want to allow users to override complex input parameters per algorithm?
* If we want to support distributed log pattern extraction, OpenSearch doesn't have shuffle exchange, how do we do the reduce operation for log pattern results from data node?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Improve patterns command with more advanced log pattern algorithms #3251

Problem Statement

Current State

Long-Term Goals

Proposal

Approach

First Stage

Current Simple Log Pattern Plan

New Simple Log Pattern Plan

New Brain Log Pattern Plan

Second Stage

Alternative

New operator

Implementation Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Improve patterns command with more advanced log pattern algorithms #3251

Description

Problem Statement

Current State

Long-Term Goals

Proposal

Approach

First Stage

Current Simple Log Pattern Plan

New Simple Log Pattern Plan

New Brain Log Pattern Plan

Second Stage

Alternative

New operator

Implementation Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions