Skip to content

Integrate Random Forest models#450

Merged
mlodic merged 19 commits intointelowlproject:developfrom
regulartim:rf_models
Feb 17, 2025
Merged

Integrate Random Forest models#450
mlodic merged 19 commits intointelowlproject:developfrom
regulartim:rf_models

Conversation

@regulartim
Copy link
Copy Markdown
Collaborator

@regulartim regulartim commented Feb 13, 2025

Description

Integrate Random Forest models

Related issues

Type of change

  • New feature (non-breaking change which adds functionality).

Checklist

  • I have read and understood the rules about how to Contribute to this project.
  • The pull request is for the branch develop.
  • I have added documentation of the new features.
  • Linters (Black, Flake, Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
  • I have added tests for the feature/bug I solved. All the tests (new and old ones) gave 0 errors.
  • If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).
  • If the GUI has been modified:
    • I have a provided a screenshot of the result in the PR.
    • I have created new frontend tests for the new component or updated existing ones.

Important Rules

  • If you miss to compile the Checklist properly, your PR won't be reviewed by the maintainers.
  • If your changes decrease the overall tests coverage (you will know after the Codecov CI job is done), you should add the required tests to fix the problem
  • Everytime you make changes to the PR and you think the work is done, you should explicitly ask for a review. After being reviewed and received a "change request", you should explicitly ask for a review again once you have made the requested changes.

@regulartim
Copy link
Copy Markdown
Collaborator Author

@mlodic I need to store the ML models as well as the training data persistently. I do not have any experience how to do that with Docker. However I found a solution. Would you please check whether my solution is sensible or not?

Copy link
Copy Markdown
Member

@mlodic mlodic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate all the doc that you added. It's really easy to follow what you have done.

Comment on lines +101 to +105
"train_and_update": {
"task": "greedybear.tasks.chain_train_and_update",
"schedule": crontab(hour=0, minute=hp_extraction_interval // 2),
"options": {"queue": "default"},
},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much time does this task need?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just thinking about the fact that it could interfere with the celery worker used by the other tasks. In case we can create an additional custom celery worker for this use case (that should be ideal).

Right now the concurrency is set to one (see default.yml)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much time does this task need?

I expect it to finish in less then a minute but I have not tested in on my "real" database. Will do that tomorrow and report the results here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just thinking about the fact that it could interfere with the celery worker used by the other tasks. In case we can create an additional custom celery worker for this use case (that should be ideal).

Right now the concurrency is set to one (see default.yml)

I was also thinking about that and I played around with chaining tasks etc. but I think it is fine that way. It is important that the training takes place after the scheduled midnight extraction, but before the next extraction. So a single worker who postpones the next extraction if the training is not finished is actually advantageous.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok so can you please tell me why this is important and, if possible, explain it briefly as a comment? I am just worried that in the future someone (me :P) could change it without understanding consequences.

Copy link
Copy Markdown
Collaborator Author

@regulartim regulartim Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # Important:
    # The training task must be run with  a small offset after midnight (00:00) 
    # to ensure training data aligns with complete calendar days.
    # The small offset is to make sure that the midnight extraction task is completed before training.
    # This way models learn from complete rather than partial day patterns, which is crucial for their performance.

Is that easy to understand?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2025-02-14 13:58:01,375 - greedybear.cronjobs.scoring.scorer.RFClassifier - __init__ - DEBUG - initialize Random Forest Classifier - trainable: True
2025-02-14 13:58:01,376 - greedybear.cronjobs.scoring.scorer.RFRegressor - __init__ - DEBUG - initialize Random Forest Regressor - trainable: True
2025-02-14 13:58:01,376 - greedybear.cronjobs.base.TrainModels - execute - INFO - Starting execution
2025-02-14 13:58:01,376 - greedybear.cronjobs.base.TrainModels - run - INFO - fetching current IoC data from DB
2025-02-14 13:58:03,093 - greedybear.cronjobs.base.TrainModels - run - INFO - current IoC data is from 2025-02-14, contains 52969 IoCs
2025-02-14 13:58:03,093 - greedybear.cronjobs.base.TrainModels - load_training_data - INFO - loading training data from file system
2025-02-14 13:58:03,126 - greedybear.cronjobs.base.TrainModels - run - INFO - training data is from 2025-02-11, contains 3344 IoCs
2025-02-14 13:58:03,134 - greedybear.cronjobs.base.TrainModels - run - INFO - extracting features from training data
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - found highly correlated features
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - days_seen_count & active_timespan: 0.74
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - days_seen_count & days_since_first_seen: 0.74
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_timespan & active_days_ratio: -0.94
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_timespan & std_days_between: 0.77
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_timespan & days_since_first_seen: 1.00
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_days_ratio & std_days_between: -0.79
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_days_ratio & days_since_first_seen: -0.94
2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - std_days_between & days_since_first_seen: 0.77
2025-02-14 13:58:03,265 - greedybear.cronjobs.scoring.scorer.RFClassifier - train - INFO - start training Random Forest Classifier
2025-02-14 13:58:03,962 - greedybear.cronjobs.scoring.scorer.RFClassifier - train - INFO - finished training Random Forest Classifier - recall AUC: 0.2157
2025-02-14 13:58:03,962 - greedybear.cronjobs.scoring.scorer.RFClassifier - save - INFO - saving Random Forest Classifier model to file system
2025-02-14 13:58:04,038 - greedybear.cronjobs.scoring.scorer.RFRegressor - train - INFO - start training Random Forest Regressor
2025-02-14 13:58:04,278 - greedybear.cronjobs.scoring.scorer.RFRegressor - train - INFO - finished training Random Forest Regressor - recall AUC: 0.8083
2025-02-14 13:58:04,278 - greedybear.cronjobs.scoring.scorer.RFRegressor - save - INFO - saving Random Forest Regressor model to file system
2025-02-14 13:58:04,297 - greedybear.cronjobs.base.TrainModels - save_training_data - INFO - saving current data for future training
2025-02-14 13:58:05,161 - greedybear.cronjobs.base.TrainModels - execute - INFO - Finished execution
2025-02-14 13:58:12,040 - greedybear.cronjobs.scoring.scorer.RFClassifier - __init__ - DEBUG - initialize Random Forest Classifier - trainable: True
2025-02-14 13:58:12,040 - greedybear.cronjobs.scoring.scorer.RFRegressor - __init__ - DEBUG - initialize Random Forest Regressor - trainable: True
2025-02-14 13:58:12,040 - greedybear.cronjobs.base.UpdateScores - execute - INFO - Starting execution
2025-02-14 13:58:12,047 - greedybear.cronjobs.base.UpdateScores - run - INFO - extracting features
2025-02-14 13:58:14,023 - greedybear.cronjobs.scoring.scorer.RFClassifier - score - INFO - calculate recurrence_probability with Random Forest Classifier
2025-02-14 13:58:14,476 - greedybear.cronjobs.scoring.scorer.RFClassifier - model - INFO - loading Random Forest Classifier model from file system
2025-02-14 13:58:15,205 - greedybear.cronjobs.scoring.scorer.RFRegressor - score - INFO - calculate expected_interactions with Random Forest Regressor
2025-02-14 13:58:15,663 - greedybear.cronjobs.scoring.scorer.RFRegressor - model - INFO - loading Random Forest Regressor model from file system
2025-02-14 13:58:15,912 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - begin updating scores
2025-02-14 13:58:16,844 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - checking 89042 IoCs
2025-02-14 13:58:18,832 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - writing updated scores for 52976 IoCs to DB
2025-02-14 13:58:38,592 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - 52976 IoCs were updated
2025-02-14 13:58:38,620 - greedybear.cronjobs.base.UpdateScores - execute - INFO - Finished execution

I have made a few optimisations to the update_db() method. As you can see, the whole training and updating run took ~40s on this database containing ~90k IoCs. It was performed on a 7 Year old Notebook. So I think it will be fine, even on larger Databases.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool thank you for the insights

Comment on lines +19 to +22
joblib
pandas
scikit-learn
numpy No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can please add specific versions to allow better mgmt and reproducible builds

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

from greedybear.cronjobs.scoring.utils import correlated_features, get_current_data, get_features, update_iocs
from greedybear.settings import ML_MODEL_DIRECTORY

MODELS = [RFClassifier(), RFRegressor()]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it wanted that these classifiers are executed once at the start of the application and every time this module is loaded, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not intend to execute them at the start of the application, but I do not think that it's a problem. The instantiation itself is not expensive.

Comment on lines +153 to +154
self.log.info("no data handed over from previous task")
self.log.info("fetching current IoC data from DB")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be merged

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +186 to +208
scores_by_ip = df.set_index("value")[score_names].to_dict("index")
iocs = IOC.objects.filter(Q(cowrie=True) | Q(log4j=True) | Q(general_honeypot__active=True)).filter(scanner=True).distinct()
iocs_to_update = []

for ioc in iocs:
updated = False

# Update scores if IP exists in new data
if ioc.name in scores_by_ip:
for score in score_names:
setattr(ioc, score, scores_by_ip[ioc.name][score])
updated = True
# Reset old scores to 0
else:
for score in score_names:
if getattr(ioc, score) > 0:
setattr(ioc, score, 0)
updated = True

if updated:
iocs_to_update.append(ioc)

return IOC.objects.bulk_update(iocs_to_update, score_names) if iocs_to_update else 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add few logs here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not place logs inside the loop. There may be >100k IoCs and writing one line of log for each of them is too much in my opinion. Do you agree?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I will make this function a method of the UpdateScores class and add logs outside the loop.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +101 to +102
if self.model is None:
self.load()
Copy link
Copy Markdown
Member

@mlodic mlodic Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you are doing here and in the load method can be replaced by leveraging the "property" decorator in python. In that way, you can access the self.model without bothering about the fact that it is populated or not and it would retrieved the data from the method. Plus, by leveraging the "cached_property" decorator, that would happen only once

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, that data would stay in memory cached so I think that you can completely remove these 2 lines and then allow stuff that use later (like return self.model.predict_proba(X)[:, 1]) to automatically load the model with the cached_property when it needs it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look, thanks for the review! :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to implement this, but the problem is that the model variable can be populated in two ways. By (1) loading the model from disk or by (2) training. Case 2 is the default, case 1 is relevant if training fails for some reason or if scoring is done without prior training. I think this makes using these decorators clumsy, am I right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus, you don't have to use the self.model attribute anymore for this case so it doesn't conflict with the cached property I have mentioned earlier.

Just read that again and yes, this may be the solution. :D I'll try to do it that way.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

@mlodic mlodic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlodic I need to store the ML models as well as the training data persistently. I do not have any experience how to do that with Docker. However I found a solution. Would you please check whether my solution is sensible or not?

About this, I saw you used FileSystemStorage. For me, that's ok. Are they just simple files right? I see that you are updating them every day and then you leverage ML methods on them. I don't see any problem about that.
Would it make sense to store those models over time and compare them?

I suggested some refactors, tell me whether I was not clear in my suggestions.

Comment on lines +95 to +96
X = df[self.features].copy()
y = df["interactions_on_eval_day"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is sightly different from the Classifier one, is it wantend?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The classifier estimates if an IoC will be seen again, the regressor estimates how often. Hence the difference.

Comment on lines +22 to +61
def train(self, df: pd.DataFrame) -> None:
"""
Preprocesses features, splits data into train/test sets, and
trains a Random Forest with optimized hyperparameters.
Logs model performance using recall AUC score.

Args:
df: Training data containing features and
'interactions_on_eval_day' target

Raises:
ValueError: If required features or target are missing
"""
self.log.info(f"start training {self.name}")

if "interactions_on_eval_day" not in df.columns:
raise ValueError("Missing target column 'interactions_on_eval_day'")

X = df[self.features].copy()
y = df["interactions_on_eval_day"] > 0

for feature in MULTI_VAL_FEATURES:
X = multi_label_encode(X, feature)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

params = {
"class_weight": {False: 1, True: 4},
"criterion": "entropy",
"max_depth": 10,
"max_features": "log2",
"min_samples_leaf": 6,
"min_samples_split": 3,
"n_estimators": 241,
}
self.model = RandomForestClassifier(
**params,
)
self.model.fit(X_train, y_train)
self.log.info(f"finished training {self.name} - recall AUC: {self.recall_auc(X_test, y_test):.4f}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see many differences between this train and the second one, would it make sense on your opinion to have just a single train and pass the required params to that?

In that way, you can put the train and the save in the same place in the main class. I saw them later used like that:

        for model in MODELS:
            model.train(training_df)
            model.save()

IMHO it makes sense to mantain this correlated logic together in a method inside the main class and then, in the cronjob, only call a single method.

Plus, you don't have to use the self.model attribute anymore for this case so it doesn't conflict with the cached property I have mentioned earlier.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about that.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I did not move train to MLModel but created a more special abstract class instead, RFModel, because the current training flow does not apply to non-RandomForest models.

@regulartim
Copy link
Copy Markdown
Collaborator Author

Would it make sense to store those models over time and compare them?

I don't think so.

@regulartim
Copy link
Copy Markdown
Collaborator Author

I'll write some tests later today and then this PR will be ready from my side.

@regulartim regulartim marked this pull request as ready for review February 14, 2025 16:47
@regulartim regulartim requested a review from mlodic February 14, 2025 16:47
@mlodic
Copy link
Copy Markdown
Member

mlodic commented Feb 17, 2025

super cool to have ML stuff here! :) looking forward to trying it with honeynet data!

@regulartim
Copy link
Copy Markdown
Collaborator Author

I really hope the ML stuff performs well on the honeynet instance! :D Let me know, if it doesn't.

@regulartim
Copy link
Copy Markdown
Collaborator Author

Do you perform the merge? I have another model for which I'll make a PR afterwards.

@mlodic mlodic merged commit abbd613 into intelowlproject:develop Feb 17, 2025
5 checks passed
@regulartim regulartim deleted the rf_models branch February 21, 2025 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants