Integrate Random Forest models by regulartim · Pull Request #450 · intelowlproject/GreedyBear

regulartim · 2025-02-13T09:37:49Z

Description

Integrate Random Forest models

Related issues

part of Create different type of feeds based on mathematical models #36

Type of change

New feature (non-breaking change which adds functionality).

Checklist

I have read and understood the rules about how to Contribute to this project.
The pull request is for the branch develop.
I have added documentation of the new features.
Linters (Black, Flake, Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
I have added tests for the feature/bug I solved. All the tests (new and old ones) gave 0 errors.
If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).
If the GUI has been modified:
- I have a provided a screenshot of the result in the PR.
- I have created new frontend tests for the new component or updated existing ones.

Important Rules

If you miss to compile the Checklist properly, your PR won't be reviewed by the maintainers.
If your changes decrease the overall tests coverage (you will know after the Codecov CI job is done), you should add the required tests to fix the problem
Everytime you make changes to the PR and you think the work is done, you should explicitly ask for a review. After being reviewed and received a "change request", you should explicitly ask for a review again once you have made the requested changes.

regulartim · 2025-02-13T13:20:04Z

@mlodic I need to store the ML models as well as the training data persistently. I do not have any experience how to do that with Docker. However I found a solution. Would you please check whether my solution is sensible or not?

mlodic

I really appreciate all the doc that you added. It's really easy to follow what you have done.

mlodic · 2025-02-13T13:49:14Z

greedybear/celery.py

+    "train_and_update": {
+        "task": "greedybear.tasks.chain_train_and_update",
+        "schedule": crontab(hour=0, minute=hp_extraction_interval // 2),
+        "options": {"queue": "default"},
+    },


how much time does this task need?

I was just thinking about the fact that it could interfere with the celery worker used by the other tasks. In case we can create an additional custom celery worker for this use case (that should be ideal).

Right now the concurrency is set to one (see default.yml)

how much time does this task need?

I expect it to finish in less then a minute but I have not tested in on my "real" database. Will do that tomorrow and report the results here.

I was just thinking about the fact that it could interfere with the celery worker used by the other tasks. In case we can create an additional custom celery worker for this use case (that should be ideal).

Right now the concurrency is set to one (see default.yml)

I was also thinking about that and I played around with chaining tasks etc. but I think it is fine that way. It is important that the training takes place after the scheduled midnight extraction, but before the next extraction. So a single worker who postpones the next extraction if the training is not finished is actually advantageous.

ah ok so can you please tell me why this is important and, if possible, explain it briefly as a comment? I am just worried that in the future someone (me :P) could change it without understanding consequences.

# Important: # The training task must be run with a small offset after midnight (00:00) # to ensure training data aligns with complete calendar days. # The small offset is to make sure that the midnight extraction task is completed before training. # This way models learn from complete rather than partial day patterns, which is crucial for their performance.

Is that easy to understand?

2025-02-14 13:58:01,375 - greedybear.cronjobs.scoring.scorer.RFClassifier - __init__ - DEBUG - initialize Random Forest Classifier - trainable: True 2025-02-14 13:58:01,376 - greedybear.cronjobs.scoring.scorer.RFRegressor - __init__ - DEBUG - initialize Random Forest Regressor - trainable: True 2025-02-14 13:58:01,376 - greedybear.cronjobs.base.TrainModels - execute - INFO - Starting execution 2025-02-14 13:58:01,376 - greedybear.cronjobs.base.TrainModels - run - INFO - fetching current IoC data from DB 2025-02-14 13:58:03,093 - greedybear.cronjobs.base.TrainModels - run - INFO - current IoC data is from 2025-02-14, contains 52969 IoCs 2025-02-14 13:58:03,093 - greedybear.cronjobs.base.TrainModels - load_training_data - INFO - loading training data from file system 2025-02-14 13:58:03,126 - greedybear.cronjobs.base.TrainModels - run - INFO - training data is from 2025-02-11, contains 3344 IoCs 2025-02-14 13:58:03,134 - greedybear.cronjobs.base.TrainModels - run - INFO - extracting features from training data 2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - found highly correlated features 2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - days_seen_count & active_timespan: 0.74 2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - days_seen_count & days_since_first_seen: 0.74 2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_timespan & active_days_ratio: -0.94 2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_timespan & std_days_between: 0.77 2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_timespan & days_since_first_seen: 1.00 2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_days_ratio & std_days_between: -0.79 2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - active_days_ratio & days_since_first_seen: -0.94 2025-02-14 13:58:03,265 - greedybear.cronjobs.base.TrainModels - run - DEBUG - std_days_between & days_since_first_seen: 0.77 2025-02-14 13:58:03,265 - greedybear.cronjobs.scoring.scorer.RFClassifier - train - INFO - start training Random Forest Classifier 2025-02-14 13:58:03,962 - greedybear.cronjobs.scoring.scorer.RFClassifier - train - INFO - finished training Random Forest Classifier - recall AUC: 0.2157 2025-02-14 13:58:03,962 - greedybear.cronjobs.scoring.scorer.RFClassifier - save - INFO - saving Random Forest Classifier model to file system 2025-02-14 13:58:04,038 - greedybear.cronjobs.scoring.scorer.RFRegressor - train - INFO - start training Random Forest Regressor 2025-02-14 13:58:04,278 - greedybear.cronjobs.scoring.scorer.RFRegressor - train - INFO - finished training Random Forest Regressor - recall AUC: 0.8083 2025-02-14 13:58:04,278 - greedybear.cronjobs.scoring.scorer.RFRegressor - save - INFO - saving Random Forest Regressor model to file system 2025-02-14 13:58:04,297 - greedybear.cronjobs.base.TrainModels - save_training_data - INFO - saving current data for future training 2025-02-14 13:58:05,161 - greedybear.cronjobs.base.TrainModels - execute - INFO - Finished execution 2025-02-14 13:58:12,040 - greedybear.cronjobs.scoring.scorer.RFClassifier - __init__ - DEBUG - initialize Random Forest Classifier - trainable: True 2025-02-14 13:58:12,040 - greedybear.cronjobs.scoring.scorer.RFRegressor - __init__ - DEBUG - initialize Random Forest Regressor - trainable: True 2025-02-14 13:58:12,040 - greedybear.cronjobs.base.UpdateScores - execute - INFO - Starting execution 2025-02-14 13:58:12,047 - greedybear.cronjobs.base.UpdateScores - run - INFO - extracting features 2025-02-14 13:58:14,023 - greedybear.cronjobs.scoring.scorer.RFClassifier - score - INFO - calculate recurrence_probability with Random Forest Classifier 2025-02-14 13:58:14,476 - greedybear.cronjobs.scoring.scorer.RFClassifier - model - INFO - loading Random Forest Classifier model from file system 2025-02-14 13:58:15,205 - greedybear.cronjobs.scoring.scorer.RFRegressor - score - INFO - calculate expected_interactions with Random Forest Regressor 2025-02-14 13:58:15,663 - greedybear.cronjobs.scoring.scorer.RFRegressor - model - INFO - loading Random Forest Regressor model from file system 2025-02-14 13:58:15,912 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - begin updating scores 2025-02-14 13:58:16,844 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - checking 89042 IoCs 2025-02-14 13:58:18,832 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - writing updated scores for 52976 IoCs to DB 2025-02-14 13:58:38,592 - greedybear.cronjobs.base.UpdateScores - update_db - INFO - 52976 IoCs were updated 2025-02-14 13:58:38,620 - greedybear.cronjobs.base.UpdateScores - execute - INFO - Finished execution

I have made a few optimisations to the update_db() method. As you can see, the whole training and updating run took ~40s on this database containing ~90k IoCs. It was performed on a 7 Year old Notebook. So I think it will be fine, even on larger Databases.

cool thank you for the insights

mlodic · 2025-02-13T13:50:37Z

requirements/project-requirements.txt

+joblib
+pandas
+scikit-learn
+numpy


if you can please add specific versions to allow better mgmt and reproducible builds

mlodic · 2025-02-13T14:03:51Z

greedybear/cronjobs/scoring/scoring_jobs.py

+from greedybear.cronjobs.scoring.utils import correlated_features, get_current_data, get_features, update_iocs
+from greedybear.settings import ML_MODEL_DIRECTORY
+
+MODELS = [RFClassifier(), RFRegressor()]


is it wanted that these classifiers are executed once at the start of the application and every time this module is loaded, right?

I did not intend to execute them at the start of the application, but I do not think that it's a problem. The instantiation itself is not expensive.

mlodic · 2025-02-13T14:05:35Z

greedybear/cronjobs/scoring/scoring_jobs.py

+            self.log.info("no data handed over from previous task")
+            self.log.info("fetching current IoC data from DB")


can be merged

mlodic · 2025-02-13T14:11:21Z

greedybear/cronjobs/scoring/utils.py

+    scores_by_ip = df.set_index("value")[score_names].to_dict("index")
+    iocs = IOC.objects.filter(Q(cowrie=True) | Q(log4j=True) | Q(general_honeypot__active=True)).filter(scanner=True).distinct()
+    iocs_to_update = []
+
+    for ioc in iocs:
+        updated = False
+
+        # Update scores if IP exists in new data
+        if ioc.name in scores_by_ip:
+            for score in score_names:
+                setattr(ioc, score, scores_by_ip[ioc.name][score])
+            updated = True
+        # Reset old scores to 0
+        else:
+            for score in score_names:
+                if getattr(ioc, score) > 0:
+                    setattr(ioc, score, 0)
+                    updated = True
+
+        if updated:
+            iocs_to_update.append(ioc)
+
+    return IOC.objects.bulk_update(iocs_to_update, score_names) if iocs_to_update else 0


can you add few logs here?

I think we should not place logs inside the loop. There may be >100k IoCs and writing one line of log for each of them is too much in my opinion. Do you agree?

But I will make this function a method of the UpdateScores class and add logs outside the loop.

mlodic · 2025-02-13T14:22:20Z

greedybear/cronjobs/scoring/ml_model.py

+        if self.model is None:
+            self.load()


what you are doing here and in the load method can be replaced by leveraging the "property" decorator in python. In that way, you can access the self.model without bothering about the fact that it is populated or not and it would retrieved the data from the method. Plus, by leveraging the "cached_property" decorator, that would happen only once

then, that data would stay in memory cached so I think that you can completely remove these 2 lines and then allow stuff that use later (like return self.model.predict_proba(X)[:, 1]) to automatically load the model with the cached_property when it needs it

I'll take a look, thanks for the review! :)

I have tried to implement this, but the problem is that the model variable can be populated in two ways. By (1) loading the model from disk or by (2) training. Case 2 is the default, case 1 is relevant if training fails for some reason or if scoring is done without prior training. I think this makes using these decorators clumsy, am I right?

Plus, you don't have to use the self.model attribute anymore for this case so it doesn't conflict with the cached property I have mentioned earlier.

Just read that again and yes, this may be the solution. :D I'll try to do it that way.

mlodic

@mlodic I need to store the ML models as well as the training data persistently. I do not have any experience how to do that with Docker. However I found a solution. Would you please check whether my solution is sensible or not?

About this, I saw you used FileSystemStorage. For me, that's ok. Are they just simple files right? I see that you are updating them every day and then you leverage ML methods on them. I don't see any problem about that.
Would it make sense to store those models over time and compare them?

I suggested some refactors, tell me whether I was not clear in my suggestions.

mlodic · 2025-02-13T14:47:39Z

greedybear/cronjobs/scoring/random_forest.py

+        X = df[self.features].copy()
+        y = df["interactions_on_eval_day"]


this is sightly different from the Classifier one, is it wantend?

Yes. The classifier estimates if an IoC will be seen again, the regressor estimates how often. Hence the difference.

mlodic · 2025-02-13T14:53:53Z

greedybear/cronjobs/scoring/random_forest.py

+    def train(self, df: pd.DataFrame) -> None:
+        """
+        Preprocesses features, splits data into train/test sets, and
+        trains a Random Forest with optimized hyperparameters.
+        Logs model performance using recall AUC score.
+
+        Args:
+            df: Training data containing features and
+                'interactions_on_eval_day' target
+
+        Raises:
+            ValueError: If required features or target are missing
+        """
+        self.log.info(f"start training {self.name}")
+
+        if "interactions_on_eval_day" not in df.columns:
+            raise ValueError("Missing target column 'interactions_on_eval_day'")
+
+        X = df[self.features].copy()
+        y = df["interactions_on_eval_day"] > 0
+
+        for feature in MULTI_VAL_FEATURES:
+            X = multi_label_encode(X, feature)
+
+        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
+
+        params = {
+            "class_weight": {False: 1, True: 4},
+            "criterion": "entropy",
+            "max_depth": 10,
+            "max_features": "log2",
+            "min_samples_leaf": 6,
+            "min_samples_split": 3,
+            "n_estimators": 241,
+        }
+        self.model = RandomForestClassifier(
+            **params,
+        )
+        self.model.fit(X_train, y_train)
+        self.log.info(f"finished training {self.name} - recall AUC: {self.recall_auc(X_test, y_test):.4f}")


I don't see many differences between this train and the second one, would it make sense on your opinion to have just a single train and pass the required params to that?

In that way, you can put the train and the save in the same place in the main class. I saw them later used like that:

for model in MODELS: model.train(training_df) model.save()

IMHO it makes sense to mantain this correlated logic together in a method inside the main class and then, in the cronjob, only call a single method.

Plus, you don't have to use the self.model attribute anymore for this case so it doesn't conflict with the cached property I have mentioned earlier.

I'll think about that.

Done. I did not move train to MLModel but created a more special abstract class instead, RFModel, because the current training flow does not apply to non-RandomForest models.

greedybear/cronjobs/scoring/ml_model.py

regulartim · 2025-02-13T15:28:49Z

Would it make sense to store those models over time and compare them?

I don't think so.

regulartim · 2025-02-14T10:48:13Z

I'll write some tests later today and then this PR will be ready from my side.

mlodic · 2025-02-17T14:56:00Z

super cool to have ML stuff here! :) looking forward to trying it with honeynet data!

regulartim · 2025-02-17T15:05:14Z

I really hope the ML stuff performs well on the honeynet instance! :D Let me know, if it doesn't.

regulartim · 2025-02-17T15:06:38Z

Do you perform the merge? I have another model for which I'll make a PR afterwards.

regulartim added 2 commits February 13, 2025 10:33

separate Elastic related stuff from Cronjob class

b718c9b

integrate models

dc83920

mlodic reviewed Feb 13, 2025

View reviewed changes

greedybear/cronjobs/scoring/ml_model.py Show resolved Hide resolved

regulartim and others added 8 commits February 13, 2025 16:36

add version numbers to new project requirements

d52fb20

merge two log calls

6901af8

add comment

991e5f8

fix failing build when mlmodels directory already exists

7971bf5

move function inside UpdateScores class and add logging

77831c7

improvements on ML classes

61c09ee

Merge branch 'develop' into rf_models

16ece3b

move initialization logging to Scorer class

9ce7f73

regulartim added 8 commits February 14, 2025 15:01

rename constant

f74fa28

optimize DB update

e958c39

remove unused constant

579fcf3

fix minor bug in API

c5ad161

add tests for scoring utils

80fb813

remove useless init method

74f7ec4

add tests for Random Forest models

25d034f

remove unused imports

861552f

regulartim marked this pull request as ready for review February 14, 2025 16:47

regulartim requested a review from mlodic February 14, 2025 16:47

only update IoC if score has changed

20181ff

mlodic approved these changes Feb 17, 2025

View reviewed changes

mlodic merged commit abbd613 into intelowlproject:develop Feb 17, 2025
5 checks passed

regulartim mentioned this pull request Feb 17, 2025

Deliver scores in the Feeds API #460

Closed

regulartim deleted the rf_models branch February 21, 2025 14:30

		self.log.info("no data handed over from previous task")
		self.log.info("fetching current IoC data from DB")

		X = df[self.features].copy()
		y = df["interactions_on_eval_day"]

Uh oh!

Conversation

regulartim commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Type of change

Checklist

Important Rules

Uh oh!

regulartim commented Feb 13, 2025

Uh oh!

mlodic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

regulartim Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mlodic Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mlodic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

regulartim commented Feb 13, 2025 •

edited

Loading

regulartim Feb 13, 2025 •

edited

Loading

mlodic Feb 13, 2025 •

edited

Loading