Querying Elasticsearch might be very meomory intensive

Following up on #624 the new way of extraction hits from Elasticsearch might be too memory intensive.

https://github.com/intelowlproject/GreedyBear/blob/d5a9906da5cd3ebf293ffedc77f466400cf0b1be/greedybear/cronjobs/repositories/elastic.py#L74-L79

In line 74, the `list` constructor is called on `search.scan()`. The `scan` method returns an iterator containing all hits from the requested time span. But because we want to cache it, we have to create a list from the iterator, which means that all hits will be stored in system memory. This might be an issue when
a) the time span is very long (e.g. 3 days on the initial extraction run) and
b) the T-Pot instance records a high number of attacks

Since I am currently only using a T-Pot instance with very few active honeypots which is running on a residential internet connection, I only see ~25.000 honeypot hits a day, which is not a problem. So it would be nice if someone could test this on a GreedyBear instance that gets attacked more frequently. 

Currently this code is only in `develop` and before it gets merged we should make sure that it is not too bad. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Querying Elasticsearch might be very meomory intensive #630

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Querying Elasticsearch might be very meomory intensive #630

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions