Integrate Random Forest models#450
Conversation
|
@mlodic I need to store the ML models as well as the training data persistently. I do not have any experience how to do that with Docker. However I found a solution. Would you please check whether my solution is sensible or not? |
mlodic
left a comment
There was a problem hiding this comment.
I really appreciate all the doc that you added. It's really easy to follow what you have done.
| "train_and_update": { | ||
| "task": "greedybear.tasks.chain_train_and_update", | ||
| "schedule": crontab(hour=0, minute=hp_extraction_interval // 2), | ||
| "options": {"queue": "default"}, | ||
| }, |
There was a problem hiding this comment.
how much time does this task need?
There was a problem hiding this comment.
I was just thinking about the fact that it could interfere with the celery worker used by the other tasks. In case we can create an additional custom celery worker for this use case (that should be ideal).
Right now the concurrency is set to one (see default.yml)
There was a problem hiding this comment.
how much time does this task need?
I expect it to finish in less then a minute but I have not tested in on my "real" database. Will do that tomorrow and report the results here.
There was a problem hiding this comment.
I was just thinking about the fact that it could interfere with the celery worker used by the other tasks. In case we can create an additional custom celery worker for this use case (that should be ideal).
Right now the concurrency is set to one (see default.yml)
I was also thinking about that and I played around with chaining tasks etc. but I think it is fine that way. It is important that the training takes place after the scheduled midnight extraction, but before the next extraction. So a single worker who postpones the next extraction if the training is not finished is actually advantageous.
There was a problem hiding this comment.
ah ok so can you please tell me why this is important and, if possible, explain it briefly as a comment? I am just worried that in the future someone (me :P) could change it without understanding consequences.
There was a problem hiding this comment.
# Important:
# The training task must be run with a small offset after midnight (00:00)
# to ensure training data aligns with complete calendar days.
# The small offset is to make sure that the midnight extraction task is completed before training.
# This way models learn from complete rather than partial day patterns, which is crucial for their performance.Is that easy to understand?
There was a problem hiding this comment.
2025-02-14 13:58:01,375 - greedybear.cronjobs.scoring.scorer.RFClassifier - __init__ - DEBUG - initialize Random Forest Classifier - trainable: True
2025-02-14 13:58:01,376 - greedybear.cronjobs.scoring.scorer.RFRegressor - __init__ - DEBUG - initialize Random Forest Regressor - trainable: True
2025-02-14 13:58:01,376 - greedybear.cronjobs.base.TrainModels - execute - INFO - Starting execution
2025-02-14 13:58:01,376 - greedybear.cronjobs.base.TrainModels - run - INFO - fetching current IoC data from DB
2025-02-14 13:58:03,093 - greedybear.cronjobs.base.TrainModels - run - INFO - current IoC data is from 2025-02-14, contains 52969 IoCs
2025-02-14 13:58:03,093 - greedybear.cronjobs.base.TrainModels - load_training_data - INFO - loading training data from file system
2025-02-14 13:58:03,126 - greedybear.cronjobs.base.TrainModels - run - INFO - training data is from 2025-02-11, contains 3344 IoCs
2025-02-14 13:58:03,134 - greedybear.cronjobs.base.TrainModels - run - INFO - extracting features from training data
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - found highly correlated features
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - days_seen_count & active_timespan: 0.74
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - days_seen_count & days_since_first_seen: 0.74
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_timespan & active_days_ratio: -0.94
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_timespan & std_days_between: 0.77
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_timespan & days_since_first_seen: 1.00
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_days_ratio & std_days_between: -0.79
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_days_ratio & days_since_first_seen: -0.94
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - std_days_between & days_since_first_seen: 0.77
2025-02-14 13:58:03,265 - greedybear.cronjobs.scoring.scorer.RFClassifier - train - INFO - start training Random Forest Classifier
2025-02-14 13:58:03,962 - greedybear.cronjobs.scoring.scorer.RFClassifier - train - INFO - finished training Random Forest Classifier - recall AUC: 0.2157
2025-02-14 13:58:03,962 - greedybear.cronjobs.scoring.scorer.RFClassifier - save - INFO - saving Random Forest Classifier model to file system
2025-02-14 13:58:04,038 - greedybear.cronjobs.scoring.scorer.RFRegressor - train - INFO - start training Random Forest Regressor
2025-02-14 13:58:04,278 - greedybear.cronjobs.scoring.scorer.RFRegressor - train - INFO - finished training Random Forest Regressor - recall AUC: 0.8083
2025-02-14 13:58:04,278 - greedybear.cronjobs.scoring.scorer.RFRegressor - save - INFO - saving Random Forest Regressor model to file system
2025-02-14 13:58:04,297 - greedybear.cronjobs.base.TrainModels - save_training_data - INFO - saving current data for future training
2025-02-14 13:58:05,161 - greedybear.cronjobs.base.TrainModels - execute - INFO - Finished execution
2025-02-14 13:58:12,040 - greedybear.cronjobs.scoring.scorer.RFClassifier - __init__ - DEBUG - initialize Random Forest Classifier - trainable: True
2025-02-14 13:58:12,040 - greedybear.cronjobs.scoring.scorer.RFRegressor - __init__ - DEBUG - initialize Random Forest Regressor - trainable: True
2025-02-14 13:58:12,040 - greedybear.cronjobs.base.UpdateScores - execute - INFO - Starting execution
2025-02-14 13:58:12,047 - greedybear.cronjobs.base.UpdateScores - run - INFO - extracting features
2025-02-14 13:58:14,023 - greedybear.cronjobs.scoring.scorer.RFClassifier - score - INFO - calculate recurrence_probability with Random Forest Classifier
2025-02-14 13:58:14,476 - greedybear.cronjobs.scoring.scorer.RFClassifier - model - INFO - loading Random Forest Classifier model from file system
2025-02-14 13:58:15,205 - greedybear.cronjobs.scoring.scorer.RFRegressor - score - INFO - calculate expected_interactions with Random Forest Regressor
2025-02-14 13:58:15,663 - greedybear.cronjobs.scoring.scorer.RFRegressor - model - INFO - loading Random Forest Regressor model from file system
2025-02-14 13:58:15,912 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - begin updating scores
2025-02-14 13:58:16,844 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - checking 89042 IoCs
2025-02-14 13:58:18,832 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - writing updated scores for 52976 IoCs to DB
2025-02-14 13:58:38,592 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - 52976 IoCs were updated
2025-02-14 13:58:38,620 - greedybear.cronjobs.base.UpdateScores - execute - INFO - Finished execution
I have made a few optimisations to the update_db() method. As you can see, the whole training and updating run took ~40s on this database containing ~90k IoCs. It was performed on a 7 Year old Notebook. So I think it will be fine, even on larger Databases.
| joblib | ||
| pandas | ||
| scikit-learn | ||
| numpy No newline at end of file |
There was a problem hiding this comment.
if you can please add specific versions to allow better mgmt and reproducible builds
| from greedybear.cronjobs.scoring.utils import correlated_features, get_current_data, get_features, update_iocs | ||
| from greedybear.settings import ML_MODEL_DIRECTORY | ||
|
|
||
| MODELS = [RFClassifier(), RFRegressor()] |
There was a problem hiding this comment.
is it wanted that these classifiers are executed once at the start of the application and every time this module is loaded, right?
There was a problem hiding this comment.
I did not intend to execute them at the start of the application, but I do not think that it's a problem. The instantiation itself is not expensive.
| self.log.info("no data handed over from previous task") | ||
| self.log.info("fetching current IoC data from DB") |
greedybear/cronjobs/scoring/utils.py
Outdated
| scores_by_ip = df.set_index("value")[score_names].to_dict("index") | ||
| iocs = IOC.objects.filter(Q(cowrie=True) | Q(log4j=True) | Q(general_honeypot__active=True)).filter(scanner=True).distinct() | ||
| iocs_to_update = [] | ||
|
|
||
| for ioc in iocs: | ||
| updated = False | ||
|
|
||
| # Update scores if IP exists in new data | ||
| if ioc.name in scores_by_ip: | ||
| for score in score_names: | ||
| setattr(ioc, score, scores_by_ip[ioc.name][score]) | ||
| updated = True | ||
| # Reset old scores to 0 | ||
| else: | ||
| for score in score_names: | ||
| if getattr(ioc, score) > 0: | ||
| setattr(ioc, score, 0) | ||
| updated = True | ||
|
|
||
| if updated: | ||
| iocs_to_update.append(ioc) | ||
|
|
||
| return IOC.objects.bulk_update(iocs_to_update, score_names) if iocs_to_update else 0 |
There was a problem hiding this comment.
I think we should not place logs inside the loop. There may be >100k IoCs and writing one line of log for each of them is too much in my opinion. Do you agree?
There was a problem hiding this comment.
But I will make this function a method of the UpdateScores class and add logs outside the loop.
| if self.model is None: | ||
| self.load() |
There was a problem hiding this comment.
what you are doing here and in the load method can be replaced by leveraging the "property" decorator in python. In that way, you can access the self.model without bothering about the fact that it is populated or not and it would retrieved the data from the method. Plus, by leveraging the "cached_property" decorator, that would happen only once
There was a problem hiding this comment.
then, that data would stay in memory cached so I think that you can completely remove these 2 lines and then allow stuff that use later (like return self.model.predict_proba(X)[:, 1]) to automatically load the model with the cached_property when it needs it
There was a problem hiding this comment.
I'll take a look, thanks for the review! :)
There was a problem hiding this comment.
I have tried to implement this, but the problem is that the model variable can be populated in two ways. By (1) loading the model from disk or by (2) training. Case 2 is the default, case 1 is relevant if training fails for some reason or if scoring is done without prior training. I think this makes using these decorators clumsy, am I right?
There was a problem hiding this comment.
Plus, you don't have to use the self.model attribute anymore for this case so it doesn't conflict with the cached property I have mentioned earlier.
Just read that again and yes, this may be the solution. :D I'll try to do it that way.
mlodic
left a comment
There was a problem hiding this comment.
@mlodic I need to store the ML models as well as the training data persistently. I do not have any experience how to do that with Docker. However I found a solution. Would you please check whether my solution is sensible or not?
About this, I saw you used FileSystemStorage. For me, that's ok. Are they just simple files right? I see that you are updating them every day and then you leverage ML methods on them. I don't see any problem about that.
Would it make sense to store those models over time and compare them?
I suggested some refactors, tell me whether I was not clear in my suggestions.
| X = df[self.features].copy() | ||
| y = df["interactions_on_eval_day"] |
There was a problem hiding this comment.
this is sightly different from the Classifier one, is it wantend?
There was a problem hiding this comment.
Yes. The classifier estimates if an IoC will be seen again, the regressor estimates how often. Hence the difference.
| def train(self, df: pd.DataFrame) -> None: | ||
| """ | ||
| Preprocesses features, splits data into train/test sets, and | ||
| trains a Random Forest with optimized hyperparameters. | ||
| Logs model performance using recall AUC score. | ||
|
|
||
| Args: | ||
| df: Training data containing features and | ||
| 'interactions_on_eval_day' target | ||
|
|
||
| Raises: | ||
| ValueError: If required features or target are missing | ||
| """ | ||
| self.log.info(f"start training {self.name}") | ||
|
|
||
| if "interactions_on_eval_day" not in df.columns: | ||
| raise ValueError("Missing target column 'interactions_on_eval_day'") | ||
|
|
||
| X = df[self.features].copy() | ||
| y = df["interactions_on_eval_day"] > 0 | ||
|
|
||
| for feature in MULTI_VAL_FEATURES: | ||
| X = multi_label_encode(X, feature) | ||
|
|
||
| X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y) | ||
|
|
||
| params = { | ||
| "class_weight": {False: 1, True: 4}, | ||
| "criterion": "entropy", | ||
| "max_depth": 10, | ||
| "max_features": "log2", | ||
| "min_samples_leaf": 6, | ||
| "min_samples_split": 3, | ||
| "n_estimators": 241, | ||
| } | ||
| self.model = RandomForestClassifier( | ||
| **params, | ||
| ) | ||
| self.model.fit(X_train, y_train) | ||
| self.log.info(f"finished training {self.name} - recall AUC: {self.recall_auc(X_test, y_test):.4f}") |
There was a problem hiding this comment.
I don't see many differences between this train and the second one, would it make sense on your opinion to have just a single train and pass the required params to that?
In that way, you can put the train and the save in the same place in the main class. I saw them later used like that:
for model in MODELS:
model.train(training_df)
model.save()
IMHO it makes sense to mantain this correlated logic together in a method inside the main class and then, in the cronjob, only call a single method.
Plus, you don't have to use the self.model attribute anymore for this case so it doesn't conflict with the cached property I have mentioned earlier.
There was a problem hiding this comment.
I'll think about that.
There was a problem hiding this comment.
Done. I did not move train to MLModel but created a more special abstract class instead, RFModel, because the current training flow does not apply to non-RandomForest models.
I don't think so. |
|
I'll write some tests later today and then this PR will be ready from my side. |
|
super cool to have ML stuff here! :) looking forward to trying it with honeynet data! |
|
I really hope the ML stuff performs well on the honeynet instance! :D Let me know, if it doesn't. |
|
Do you perform the merge? I have another model for which I'll make a PR afterwards. |
Description
Integrate Random Forest models
Related issues
Type of change
Checklist
develop.Black,Flake,Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.Important Rules