Improve extraction speed. Closes #422. by regulartim · Pull Request #423 · GreedyBear-Project/GreedyBear

regulartim · 2025-01-09T18:01:00Z

Description

Changes made:

added a DB index on IOCs name field, primarily to speed up the existence check, when new IOC data is added in _add_ioc()
used QuerySet.exists() in _check_first_time_run() as mentioned in IOC extraction process slow #422
for cowrie I changed that the ioc_record obtained previously in _add_ioc() will now be also used for adding cowrie sessions
I rewrote the time window calculation, put it in a separate function and wrote tests for it. The main idea behind this change is to get it to reliably calculate the intended window even if the triggering cronjob takes more than a minute to run.

Related issues

IOC extraction process slow #422

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue).

Checklist

I have read and understood the rules about how to Contribute to this project.
The pull request is for the branch develop.
I have added documentation of the new features.
Linters (Black, Flake, Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
I have added tests for the feature/bug I solved. All the tests (new and old ones) gave 0 errors.
If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).
If the GUI has been modified:
- I have a provided a screenshot of the result in the PR.
- I have created new frontend tests for the new component or updated existing ones.

Important Rules

If you miss to compile the Checklist properly, your PR won't be reviewed by the maintainers.
If your changes decrease the overall tests coverage (you will know after the Codecov CI job is done), you should add the required tests to fix the problem
Everytime you make changes to the PR and you think the work is done, you should explicitly ask for a review. After being reviewed and received a "change request", you should explicitly ask for a review again once you have made the requested changes.

…ce number of DB queries

drosetti · 2025-01-14T17:35:43Z

            self.log.info("Finished execution")
+
+
+def get_time_window(reference_time, window_minutes: int, additional_lookback: int = 0) -> tuple:


I thinks it's redundant a double parameter for the time range. I mean the fact that it's a param it allows it to customize it. why we should pass two parameters to do the same thing ?

The parameters have different semantic meanings:

window_minutes defines the basic size of the time window and influences the rounding/orientation of the window

additional_lookback only extends the start time backwards without changing the basic structure of the window

Consider this case: In our default configuration we extract every 10 minutes but look back 3 days on the first run. If the first run happens at 14:21 on January 5th, we want the time window to be 14:20 (January 2nd) - 14:20 (January 5th). But if window_minutes = 3*24*60, we would get a window of 0:00 (January 1st) - 0:00 (January 3rd) as this is the last closed time window of this size. The next run at 14:30 would then extract 14:20 (January 5th) - 14:30 (January 5th) so we would miss data between 0:00 (January 3rd) and 14:20 (January 5th).

Does that make sense?

As you said is the "default configuration" it is supposed to be changed if needed.

Also I don't get this part:

The next run at 14:30 would then extract 14:20 (January 5th) - 14:30 (January 5th) so we would miss data between 0:00 (January 3rd) and 14:20 (January 5th).

Why ? Data have been extracted in the previous run 14:20 (January 2nd) - 14:20 (January 5th).

Correct me if I'm wrong or If i'm missing something

Why ? Data have been extracted in the previous run 14:20 (January 2nd) - 14:20 (January 5th).

No, in my example, the first run happens at 14:21 on January 5th. But it seems like my example was not the best. I'll try to explain it differently:

As I wrote, window_minutes defines the basic size of the time window. We need this parameter in case a job starts late. Let's consider window_minutes = 10. Our 14:30 job starts at 14:31 for whatever reason. Because of window_minutes, we extract the correct time interval (14:20-14:30). If we would only have a simple lookback parameter, we would extract a wrong interval (14:21-14:31), which may lead to duplicates and missed data.

The additional_lookback extends the start time backwards. We need this parameter in case we want to deviate from our normal extraction interval. This happens on the first run, when we extract the last 3 days, because of this method in attacks.py:

def minutes_back_to_lookup(self): # overwrite base if self.minutes_back: minutes = self.minutes_back elif self.first_time_run: minutes = 60 * 24 * 3 # 3 days else: minutes = 11 if LEGACY_EXTRACTION else EXTRACTION_INTERVAL return minute

If we would not have this deviation, we also would not need the additional_lookback parameter.

Now to back to my example: consider the first run happens at 14:21 on January 5th. If we would not have the additional_lookback parameter we would have to set window_minutes = 60 * 24 * 3. in this case we would get a wrong interval of 0:00 (on January 1st) - 0:00 (on January 3rd). But what we actually want is 14:20 (January 2nd) - 14:20 (January 5th). In my opinion we can only achieve this using both parameters. Does that make it clearer for you?

I see two point here:

1 - why we need an additional parameter with the same beaviour (always a time shift)? Just change the value of the parameter, it's what parameter are meant to be: You have a parameter with 10 and a parameter with x - 10, just put x in one parameter, I think it's easier. Also the fact that you have to do the subtraction means they are strictly related.
2 - I don't think this solve the problem of the delay in the execution. If the cron execute at 14:31 you have to take 14:20-14:30, but this is not resolved with this solution because you will always have 3 days, all delayed of 1 minute. If we really want to solve this problem we have to save in the db when we stopped in the previous run and in this run take all the time.

Let me know if I explained clearer 😄

Yes, I think I got it now! Thanks for your patience and your effort! :) I'll think about it and push a change tomorrow.

No problem, glad to be helpful 😄

Don't worry about the time you don't have a deadline. It will be a problem in case a contributor proposes himself to solve an issue and then forget about it, but you are working frequently, this is not your case!

I'm here in case you have other questions, also about the development

I'm here in case you have other questions, also about the development

Thanks a lot! :)

I changed removed said argument from the function, please take a look. If the function would be a method of the Cronjob class, one could technically also remove the other two arguments, but that would make testing more difficult, so I decided against it.

2 - I don't think this solve the problem of the delay in the execution. If the cron execute at 14:31 you have to take 14:20-14:30, but this is not resolved with this solution because you will always have 3 days, all delayed of 1 minute. If we really want to solve this problem we have to save in the db when we stopped in the previous run and in this run take all the time.

I'm not sure if you're right here. Would it be possible for you to write a test case for the behaviour you are describing?

drosetti

Very good! I just noticed one thing

… and adapt tests accordingly

regulartim added 8 commits January 9, 2025 18:53

create index on name field of IOC model to speed up _add_ioc function

5889bfc

use QuerySet.exist() for better performance

2b0e113

hand over previously added IOC record to _get_sessions method to redu…

f320d4f

…ce number of DB queries

fix returning wrong IOC object

624672e

add more error-resistant time window calculation

e447921

document additional_lookback argument

b69e95b

minor improvements to get_time_window function

bc619e2

add test cases for get_time_window function

96152bb

regulartim marked this pull request as ready for review January 10, 2025 07:42

regulartim requested a review from drosetti January 10, 2025 07:42

fix error in docstring

c9f113f

drosetti reviewed Jan 14, 2025

View reviewed changes

drosetti suggested changes Jan 14, 2025

View reviewed changes

remove argument from function that is already a configuration setting…

63d77c7

… and adapt tests accordingly

regulartim requested a review from drosetti January 17, 2025 07:07

drosetti approved these changes Jan 20, 2025

View reviewed changes

drosetti merged commit c672105 into GreedyBear-Project:develop Jan 20, 2025

regulartim deleted the improve_extraction_speed branch February 21, 2025 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve extraction speed. Closes #422.#423

Improve extraction speed. Closes #422.#423
drosetti merged 10 commits intoGreedyBear-Project:developfrom
regulartim:improve_extraction_speed

regulartim commented Jan 9, 2025 •

edited

Loading

Uh oh!

drosetti Jan 14, 2025

Uh oh!

regulartim Jan 14, 2025

Uh oh!

drosetti Jan 15, 2025 •

edited

Loading

Uh oh!

regulartim Jan 15, 2025

Uh oh!

drosetti Jan 16, 2025

Uh oh!

regulartim Jan 16, 2025

Uh oh!

drosetti Jan 16, 2025

Uh oh!

drosetti Jan 16, 2025

Uh oh!

regulartim Jan 17, 2025

Uh oh!

drosetti left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		self.log.info("Finished execution")


		def get_time_window(reference_time, window_minutes: int, additional_lookback: int = 0) -> tuple:

Uh oh!

Conversation

regulartim commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Type of change

Checklist

Important Rules

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drosetti Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drosetti left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

regulartim commented Jan 9, 2025 •

edited

Loading

drosetti Jan 15, 2025 •

edited

Loading