Improve extraction speed. Closes #422.#423
Improve extraction speed. Closes #422.#423drosetti merged 10 commits intoGreedyBear-Project:developfrom
Conversation
…ce number of DB queries
| self.log.info("Finished execution") | ||
|
|
||
|
|
||
| def get_time_window(reference_time, window_minutes: int, additional_lookback: int = 0) -> tuple: |
There was a problem hiding this comment.
I thinks it's redundant a double parameter for the time range. I mean the fact that it's a param it allows it to customize it. why we should pass two parameters to do the same thing ?
There was a problem hiding this comment.
The parameters have different semantic meanings:
window_minutesdefines the basic size of the time window and influences the rounding/orientation of the windowadditional_lookbackonly extends the start time backwards without changing the basic structure of the window
Consider this case: In our default configuration we extract every 10 minutes but look back 3 days on the first run. If the first run happens at 14:21 on January 5th, we want the time window to be 14:20 (January 2nd) - 14:20 (January 5th). But if window_minutes = 3*24*60, we would get a window of 0:00 (January 1st) - 0:00 (January 3rd) as this is the last closed time window of this size. The next run at 14:30 would then extract 14:20 (January 5th) - 14:30 (January 5th) so we would miss data between 0:00 (January 3rd) and 14:20 (January 5th).
Does that make sense?
There was a problem hiding this comment.
As you said is the "default configuration" it is supposed to be changed if needed.
Also I don't get this part:
The next run at 14:30 would then extract 14:20 (January 5th) - 14:30 (January 5th) so we would miss data between 0:00 (January 3rd) and 14:20 (January 5th).
Why ? Data have been extracted in the previous run 14:20 (January 2nd) - 14:20 (January 5th).
Correct me if I'm wrong or If i'm missing something
There was a problem hiding this comment.
Why ? Data have been extracted in the previous run 14:20 (January 2nd) - 14:20 (January 5th).
No, in my example, the first run happens at 14:21 on January 5th. But it seems like my example was not the best. I'll try to explain it differently:
As I wrote, window_minutes defines the basic size of the time window. We need this parameter in case a job starts late. Let's consider window_minutes = 10. Our 14:30 job starts at 14:31 for whatever reason. Because of window_minutes, we extract the correct time interval (14:20-14:30). If we would only have a simple lookback parameter, we would extract a wrong interval (14:21-14:31), which may lead to duplicates and missed data.
The additional_lookback extends the start time backwards. We need this parameter in case we want to deviate from our normal extraction interval. This happens on the first run, when we extract the last 3 days, because of this method in attacks.py:
def minutes_back_to_lookup(self):
# overwrite base
if self.minutes_back:
minutes = self.minutes_back
elif self.first_time_run:
minutes = 60 * 24 * 3 # 3 days
else:
minutes = 11 if LEGACY_EXTRACTION else EXTRACTION_INTERVAL
return minuteIf we would not have this deviation, we also would not need the additional_lookback parameter.
Now to back to my example: consider the first run happens at 14:21 on January 5th. If we would not have the additional_lookback parameter we would have to set window_minutes = 60 * 24 * 3. in this case we would get a wrong interval of 0:00 (on January 1st) - 0:00 (on January 3rd). But what we actually want is 14:20 (January 2nd) - 14:20 (January 5th). In my opinion we can only achieve this using both parameters. Does that make it clearer for you?
There was a problem hiding this comment.
I see two point here:
1 - why we need an additional parameter with the same beaviour (always a time shift)? Just change the value of the parameter, it's what parameter are meant to be: You have a parameter with 10 and a parameter with x - 10, just put x in one parameter, I think it's easier. Also the fact that you have to do the subtraction means they are strictly related.
2 - I don't think this solve the problem of the delay in the execution. If the cron execute at 14:31 you have to take 14:20-14:30, but this is not resolved with this solution because you will always have 3 days, all delayed of 1 minute. If we really want to solve this problem we have to save in the db when we stopped in the previous run and in this run take all the time.
Let me know if I explained clearer 😄
There was a problem hiding this comment.
Yes, I think I got it now! Thanks for your patience and your effort! :) I'll think about it and push a change tomorrow.
There was a problem hiding this comment.
No problem, glad to be helpful 😄
Don't worry about the time you don't have a deadline. It will be a problem in case a contributor proposes himself to solve an issue and then forget about it, but you are working frequently, this is not your case!
There was a problem hiding this comment.
I'm here in case you have other questions, also about the development
There was a problem hiding this comment.
I'm here in case you have other questions, also about the development
Thanks a lot! :)
I changed removed said argument from the function, please take a look. If the function would be a method of the Cronjob class, one could technically also remove the other two arguments, but that would make testing more difficult, so I decided against it.
2 - I don't think this solve the problem of the delay in the execution. If the cron execute at 14:31 you have to take 14:20-14:30, but this is not resolved with this solution because you will always have 3 days, all delayed of 1 minute. If we really want to solve this problem we have to save in the db when we stopped in the previous run and in this run take all the time.
I'm not sure if you're right here. Would it be possible for you to write a test case for the behaviour you are describing?
drosetti
left a comment
There was a problem hiding this comment.
Very good! I just noticed one thing
… and adapt tests accordingly
Description
Changes made:
namefield, primarily to speed up the existence check, when new IOC data is added in_add_ioc()QuerySet.exists()in_check_first_time_run()as mentioned in IOC extraction process slow #422ioc_recordobtained previously in_add_ioc()will now be also used for adding cowrie sessionsRelated issues
Type of change
Please delete options that are not relevant.
Checklist
develop.Black,Flake,Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.Important Rules