Skip to content

Querying Elasticsearch might be very meomory intensive #630

@regulartim

Description

@regulartim

Following up on #624 the new way of extraction hits from Elasticsearch might be too memory intensive.

https://github.com/intelowlproject/GreedyBear/blob/d5a9906da5cd3ebf293ffedc77f466400cf0b1be/greedybear/cronjobs/repositories/elastic.py#L74-L79

In line 74, the list constructor is called on search.scan(). The scan method returns an iterator containing all hits from the requested time span. But because we want to cache it, we have to create a list from the iterator, which means that all hits will be stored in system memory. This might be an issue when
a) the time span is very long (e.g. 3 days on the initial extraction run) and
b) the T-Pot instance records a high number of attacks

Since I am currently only using a T-Pot instance with very few active honeypots which is running on a residential internet connection, I only see ~25.000 honeypot hits a day, which is not a problem. So it would be nice if someone could test this on a GreedyBear instance that gets attacked more frequently.

Currently this code is only in develop and before it gets merged we should make sure that it is not too bad.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestpythonPull requests that update Python code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions