[RFC] Security Analytics Correlation Engine

## Introduction

Security Analytics is an open-source solution for security operations in OpenSearch. Security Analytics’ threat detection engine converts the detection rules into executable OpenSearch queries which are then matched against the logs or events ingested by the user to generate `findings`. The trigger condition filters are further applied on the findings to generate `alerts`.

## Problem Statement

Today in Security Analytics, the generated `findings` & `alerts` belong to `individual log types & there is no way to automatically correlate between them`. 
As customers data spans across multiple security event logs (s3 access, VPC flow, sys log, DNS ), a finding on just one log source is not enough to increase the confidence of the finding and moreover a strong correlation across logs helps customers to dive into the relationship of data across different sources.
In order to understand this correlation across findings from different log sources, customers would manually need to browse through the list of findings generated for individual log categories & then `need to identify the correlated patterns manually`. 

Here is an example.

#### Example Infrastructure

In this sample customer infrastructure diagram shown below, the customer has a `Django REST application` which is hosted on a `EC2 Windows instance`. The `REST apis` use `Active Directory` as identity Provider & the `Django Application` uses `S3` to store & query the files. The `incoming network traffic` logs to `EC2 Windows instance` are also stored as `VPC Flow Logs`.

![image](https://user-images.githubusercontent.com/617607/224168111-aaf1ad10-48cb-467b-b91d-dd9580987c10.png)

#### Security Analytics detectors generate findings for a threat

In order for `Security Analytics`  to monitor & detect threats for the above `customer infrastructure` , we need to define a `Security Analytics Threat Detector` for each component in the infrastructure. For example, the below diagram shows that `Network Detector` is defined for `VPC Flow logs` ,`Windows Detector` is defined for `EC2 Windows instance` & so on.

![image (1)](https://user-images.githubusercontent.com/617607/224168537-569c0f3f-dd62-4d12-9c9d-a7c410fccb16.png)

Now, lets try to simulate a `security attack` on this infrastructure. In this attack, the attacker uses `sbcd90` user to call a REST api named `POST /customer_records.txt` which tries to `replicate a sensitive file named customer_records.txt` from `S3`.

In the above diagram, we show that if such an attack happens, each `Threat Detector` which is monitoring its corresponding infrastructure component generates a `finding` . For example, the `AD/Ldap Detector` generates a `finding` that an `Invalid Username/Password, ResultType: 50126` finding is generated & so on.

These `findings` as shown in the diagram are generated by `individual detectors` & they belong to their `respective log types`. But, how does the customer know that the `Ad/Ldap Detector` finding of `Invalid Username/Password, ResultType: 50126` is related to a chain of `security events` occurring around the `same time window` on the infrastructure(say the `403 Forbidden error`  finding from `Applications detector`)? 

One way is to `manually correlate this finding with list of findings belonging to other log types` within a `particular time range`. Can we possibly solve this problem automatically?

## Proposed Solution

The `Security Analytics Correlation Engine`  provides an approach to solve this issue by allowing the customers to `define different threat scenarios that can be identified from the logs generated from the individual systems in their infrastructure exactly once`  & then `generating correlations between findings from different log categories automatically`. 

`Correlation Engine` is a `Security Finding Knowledge Graph` which can be used to store connected findings data & generate `correlated insights(as well as correlated historical insights)` based on `time windows` from them . 

## Correlation Engine Feature Scope

Customer can define the most relevant `threat scenarios` between logs of different systems in their infrastructure as `correlation rules` using simple `sql-like` queries. Here is an example. 

If we want to define a `threat scenario` that can identify `403 Forbidden error` findings generated by `application detector` on a set of `windows hosts` with ip range `4.5.6.*` , we can define it as follows:

```
"field": {
  "windows": "host:4.5.6.*",
  "application": "status:403"
},
"query": true
```
Thus, this `threat scenario` connects `application logs` with `windows logs` in a particular scenario. Similarly, customer can define several `threat scenarios` for different systems in their infrastructure `based on their requirements`.

These `threat scenarios` are then used by the `Correlation Engine` to define a `graph of correlated findings`. 

`Correlation Engine` then can generate nearby findings to a particular finding, thus correlating `findings, logs & rules across log categories`. 
Here is an example `correlation` generated for `the example infrastructure described above in the diagram`.

```
GET /_plugins/_security_analytics/findings/correlate?finding=05e75ff0-4ae9-44bd-805f-893559e9fa62&detector_type=windows&time_window=120000&nearby_findings=20

{
    "findings": [
        {
            "finding": "8bf20320-a2bc-433a-a1a4-5fda16ed6875",
            "detector_type": "ad_ldap",
            "score": 1.7824930864662747E-6
        },
        {
            "finding": "52a024ba-c423-42e5-b97c-1781a875940c",
            "detector_type": "s3",
            "score": 1.6266511011053808E-5
        },
        {
            "finding": "30cc64a7-13dd-4ec4-a2bd-737ed3c80578",
            "detector_type": "others_application",
            "score": 1.6309222701238468E-5
        },
        {
            "finding": "4f20bb77-ac05-4d74-87b8-16386292d89f",
            "detector_type": "network",
            "score": 8.688701200298965E-6
        },
        {
            "finding": "e1a40ae5-70aa-4b28-a02c-9b59074499b8",
            "detector_type": "ad_ldap",
            "score": 8.688701200298965E-6
        },
        {
            "finding": "8a1678a0-8342-4734-b6ea-17dfcda9174e",
            "detector_type": "windows",
            "score": 8.07421838544542E-6
        },
        {
            "finding": "41c6a383-d0e3-4f32-b83e-ca6d927c2067",
            "detector_type": "network",
            "score": 1.7824930864662747E-6
        }
    ]
}
```
The `scores` determine the `proximity` of each `relevant(identified from threat scenarios defined by customer)` finding from the `windows` finding in query `05e75ff0-4ae9-44bd-805f-893559e9fa62` within the time window of `2 minutes`.

## Building Blocks
The `Detectors` in `Security Analytics` internally creates `Monitors` in `Alerting` which runs `periodic jobs` against the infrastructure logs generated from each component in the customer infrastructure. When these logs match the rules, `findings` are generated in `Alerting`. 

Once a `finding` is generated, an `asynchronous(fire & forget)`  transport layer call is made to the `Correlation Engine`   to correlate this `new finding` with `existing findings`. This `new finding & its correlations`  are then stored in the `HNSW Graph(or Vector storage)`.

![image (2)](https://user-images.githubusercontent.com/617607/224172022-e85851d3-3b63-4723-a2c0-e939a124e9f5.png)

## Correlation Engine internals

The `internals` of the `Correlation Engine` is composed of `4 major components`.

* *HNSW Graph based vector storage* - this is HNSW Graph based storage used to store all finding vectors  & query them at the `vector` level.
* *Insertion Algorithm* - the most important piece of the `Correlation Engine` is its insertion algorithm. In this layer, `findings` are converted to `k-dimensional vectors` & are stored in the `vector storage` layer mentioned above along with their `correlations`.
* *Search Algorithm* - the second most important piece of the `Correlation Engine` allows user to specify a particular finding, & then converts it to a `k-dimensional vector` & then uses it to query its `neighboring findings` which are actually its `correlated findings` within a `time window`.
* *Join Engine* - the Join engine determines `immediate neighbors` of a particular finding, `given the correlation metadata between the Threat Detector that generated the finding & the log categories` .

![image (3)](https://user-images.githubusercontent.com/617607/224173035-6dfcf5f0-9828-4d14-a24c-7469c7f78800.png)






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Security Analytics Correlation Engine #369

Introduction

Problem Statement

Example Infrastructure

Security Analytics detectors generate findings for a threat

Proposed Solution

Correlation Engine Feature Scope

Building Blocks

Correlation Engine internals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Security Analytics Correlation Engine #369

Description

Introduction

Problem Statement

Example Infrastructure

Security Analytics detectors generate findings for a threat

Proposed Solution

Correlation Engine Feature Scope

Building Blocks

Correlation Engine internals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions