Skip to content

Improve extraction speed. Closes #422.#423

Merged
drosetti merged 10 commits intoGreedyBear-Project:developfrom
regulartim:improve_extraction_speed
Jan 20, 2025
Merged

Improve extraction speed. Closes #422.#423
drosetti merged 10 commits intoGreedyBear-Project:developfrom
regulartim:improve_extraction_speed

Conversation

@regulartim
Copy link
Copy Markdown
Member

@regulartim regulartim commented Jan 9, 2025

Description

Changes made:

  1. added a DB index on IOCs name field, primarily to speed up the existence check, when new IOC data is added in _add_ioc()
  2. used QuerySet.exists() in _check_first_time_run() as mentioned in IOC extraction process slow #422
  3. for cowrie I changed that the ioc_record obtained previously in _add_ioc() will now be also used for adding cowrie sessions
  4. I rewrote the time window calculation, put it in a separate function and wrote tests for it. The main idea behind this change is to get it to reliably calculate the intended window even if the triggering cronjob takes more than a minute to run.

Related issues

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue).

Checklist

  • I have read and understood the rules about how to Contribute to this project.
  • The pull request is for the branch develop.
  • I have added documentation of the new features.
  • Linters (Black, Flake, Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
  • I have added tests for the feature/bug I solved. All the tests (new and old ones) gave 0 errors.
  • If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).
  • If the GUI has been modified:
    • I have a provided a screenshot of the result in the PR.
    • I have created new frontend tests for the new component or updated existing ones.

Important Rules

  • If you miss to compile the Checklist properly, your PR won't be reviewed by the maintainers.
  • If your changes decrease the overall tests coverage (you will know after the Codecov CI job is done), you should add the required tests to fix the problem
  • Everytime you make changes to the PR and you think the work is done, you should explicitly ask for a review. After being reviewed and received a "change request", you should explicitly ask for a review again once you have made the requested changes.

@regulartim regulartim marked this pull request as ready for review January 10, 2025 07:42
@regulartim regulartim requested a review from drosetti January 10, 2025 07:42
Comment thread greedybear/cronjobs/base.py Outdated
self.log.info("Finished execution")


def get_time_window(reference_time, window_minutes: int, additional_lookback: int = 0) -> tuple:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinks it's redundant a double parameter for the time range. I mean the fact that it's a param it allows it to customize it. why we should pass two parameters to do the same thing ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters have different semantic meanings:

  • window_minutes defines the basic size of the time window and influences the rounding/orientation of the window
  • additional_lookback only extends the start time backwards without changing the basic structure of the window

Consider this case: In our default configuration we extract every 10 minutes but look back 3 days on the first run. If the first run happens at 14:21 on January 5th, we want the time window to be 14:20 (January 2nd) - 14:20 (January 5th). But if window_minutes = 3*24*60, we would get a window of 0:00 (January 1st) - 0:00 (January 3rd) as this is the last closed time window of this size. The next run at 14:30 would then extract 14:20 (January 5th) - 14:30 (January 5th) so we would miss data between 0:00 (January 3rd) and 14:20 (January 5th).

Does that make sense?

Copy link
Copy Markdown
Contributor

@drosetti drosetti Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said is the "default configuration" it is supposed to be changed if needed.

Also I don't get this part:

The next run at 14:30 would then extract 14:20 (January 5th) - 14:30 (January 5th) so we would miss data between 0:00 (January 3rd) and 14:20 (January 5th).

Why ? Data have been extracted in the previous run 14:20 (January 2nd) - 14:20 (January 5th).

Correct me if I'm wrong or If i'm missing something

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ? Data have been extracted in the previous run 14:20 (January 2nd) - 14:20 (January 5th).

No, in my example, the first run happens at 14:21 on January 5th. But it seems like my example was not the best. I'll try to explain it differently:

As I wrote, window_minutes defines the basic size of the time window. We need this parameter in case a job starts late. Let's consider window_minutes = 10. Our 14:30 job starts at 14:31 for whatever reason. Because of window_minutes, we extract the correct time interval (14:20-14:30). If we would only have a simple lookback parameter, we would extract a wrong interval (14:21-14:31), which may lead to duplicates and missed data.

The additional_lookback extends the start time backwards. We need this parameter in case we want to deviate from our normal extraction interval. This happens on the first run, when we extract the last 3 days, because of this method in attacks.py:

    def minutes_back_to_lookup(self):
        # overwrite base
        if self.minutes_back:
            minutes = self.minutes_back
        elif self.first_time_run:
            minutes = 60 * 24 * 3  # 3 days
        else:
            minutes = 11 if LEGACY_EXTRACTION else EXTRACTION_INTERVAL
        return minute

If we would not have this deviation, we also would not need the additional_lookback parameter.

Now to back to my example: consider the first run happens at 14:21 on January 5th. If we would not have the additional_lookback parameter we would have to set window_minutes = 60 * 24 * 3. in this case we would get a wrong interval of 0:00 (on January 1st) - 0:00 (on January 3rd). But what we actually want is 14:20 (January 2nd) - 14:20 (January 5th). In my opinion we can only achieve this using both parameters. Does that make it clearer for you?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two point here:

1 - why we need an additional parameter with the same beaviour (always a time shift)? Just change the value of the parameter, it's what parameter are meant to be: You have a parameter with 10 and a parameter with x - 10, just put x in one parameter, I think it's easier. Also the fact that you have to do the subtraction means they are strictly related.
2 - I don't think this solve the problem of the delay in the execution. If the cron execute at 14:31 you have to take 14:20-14:30, but this is not resolved with this solution because you will always have 3 days, all delayed of 1 minute. If we really want to solve this problem we have to save in the db when we stopped in the previous run and in this run take all the time.

Let me know if I explained clearer 😄

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I got it now! Thanks for your patience and your effort! :) I'll think about it and push a change tomorrow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, glad to be helpful 😄

Don't worry about the time you don't have a deadline. It will be a problem in case a contributor proposes himself to solve an issue and then forget about it, but you are working frequently, this is not your case!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm here in case you have other questions, also about the development

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm here in case you have other questions, also about the development

Thanks a lot! :)

I changed removed said argument from the function, please take a look. If the function would be a method of the Cronjob class, one could technically also remove the other two arguments, but that would make testing more difficult, so I decided against it.

2 - I don't think this solve the problem of the delay in the execution. If the cron execute at 14:31 you have to take 14:20-14:30, but this is not resolved with this solution because you will always have 3 days, all delayed of 1 minute. If we really want to solve this problem we have to save in the db when we stopped in the previous run and in this run take all the time.

I'm not sure if you're right here. Would it be possible for you to write a test case for the behaviour you are describing?

Copy link
Copy Markdown
Contributor

@drosetti drosetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good! I just noticed one thing

@regulartim regulartim requested a review from drosetti January 17, 2025 07:07
@drosetti drosetti merged commit c672105 into GreedyBear-Project:develop Jan 20, 2025
@regulartim regulartim deleted the improve_extraction_speed branch February 21, 2025 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants