feat: use dictionary for flexibly transforming and standardizing features by joyceljy · Pull Request #418 · DeepRank/deeprank2

joyceljy · 2023-04-18T16:05:07Z

Improvement on feature standardization by applying suitable transformation for different features.
For now, we will modify the package by adding a dictionary indicating the type of transformation in which the feature should apply. The detailed description are as follows:

We won't normalize nor standardize one-hot encoded features. For motivation, see for example this thread (this is the common opinion among the community).
In general, the rule of thumb for deciding when applying transformation before standardization is to have a distribution that widespreads the features values, ideally resembling a Gaussian, but not mandatorily (for example electrostatic and vanderwaals cube version is better than the original one).
Features to which we won't apply log, but standardization directly: res_size, res_charge, hb_donors, hb_acceptors, hse, irc_ features, res_mass, res_pI, distance
Features to which we'll apply log(x+1): res_depth, bsa, info_content
Features to which we'll apply square root: sasa
Features to which we'll apply cube root: electrostatic, vanderwaals
We'll remove for now pssm since it's not correctly computed

Further steps:
Implement a user-defined dictionary, the user can decide which transformation they want to apply to each feature and insert into the dictionary.

… channels)

…esting

tests/test_dataset.py

gcroci2 · 2023-05-19T14:39:58Z

tests/test_dataset.py

+            hdf5_path = hdf5_path,
+            target='binary',
+            node_features = [node_feat_test],
+            edge_features =[],


Suggested change

edge_features =[],

edge_features = None,

From the document of GraphDataSet, we cannot pass the edge_features as None.
So I will keep edge_features =[].

From the docs, it is an Optional parameter, which means that it can be None :)

tests/test_dataset.py

gcroci2

Very useful and nice changes :) I left minor comments, please ask my review again once you're done, we're almost there! Also in general, please leave a space before = symbol and after it, and also a space after punctuation like , and :

We'll look at the PR together once you implement these changes and we'll finalize the following:

In test_standardize_graphdataset, we need to test hb_donors, for which standardize is False. And in general all the rest of the features not indicated in the dict. In order to test it, we can use _cal_mean_std with hb_donors feature; in general, we need to verify that features of dataset which are in features_transform with standardize True have mean and dev as indicated (mean around 0 and dev around 1) and different mean and dev from the same features not touched (maybe using _cal_mean_std); we also need to verify that features of dataset which are in features_transform with standardize False or which are not in the dict have mean and dev equal to the ones not touched, again maybe using _cal_mean_std
test_feature_transform_mean_std partially tests the transformation, so we need to implement a smart way to really test transformations

joyceljy · 2023-05-22T09:34:14Z

Very useful and nice changes :) I left minor comments, please ask my review again once you're done, we're almost there! Also in general, please leave a space before = symbol and after it, and also a space after punctuation like , and :

We'll look at the PR together once you implement these changes and we'll finalize the following:

In test_standardize_graphdataset, we need to test hb_donors, for which standardize is False. And in general all the rest of the features not indicated in the dict. In order to test it, we can use _cal_mean_std with hb_donors feature; in general, we need to verify that features of dataset which are in features_transform with standardize True have mean and dev as indicated (mean around 0 and dev around 1) and different mean and dev from the same features not touched (maybe using _cal_mean_std); we also need to verify that features of dataset which are in features_transform with standardize False or which are not in the dict have mean and dev equal to the ones not touched, again maybe using _cal_mean_std

test_feature_transform_mean_std partially tests the transformation, so we need to implement a smart way to really test transformations

I think not all the features with standardized True would got mean around 0 and dev around 1 even after transformation. For example for feature sasa, before transformation, its mean=45 & dev=41.5, and after transformation its mean=5.7 & dev=3.5.
Do you have any suggestions on setting up the mean and dev range? For now I do something like this:

for key, values in features_dict.items():
            if key in features_transform: 
                transform = features_transform.get(key, {}).get('transform')
                means = []
                devs = []
                (mean, dev) = _cal_mean_std(hdf5_path, features_transform, key)
                means.append(mean)
                devs.append(dev)
                means.append(values.mean())
                devs.append(values.std())
                if transform: #test transformed features
                    assert means[0] != means[1]
                    assert devs[0] != devs[1] 
                    assert -10 < means[0] < 10
                    assert -5 < devs[0] < 5
                else: #test hb_doners, no transformation so mean & std remain the same.
                    assert means[0] == means[1]
                    assert devs[0] == devs[1]

… dict

gcroci2 · 2023-05-22T11:12:08Z

Very useful and nice changes :) I left minor comments, please ask my review again once you're done, we're almost there! Also in general, please leave a space before = symbol and after it, and also a space after punctuation like , and :
We'll look at the PR together once you implement these changes and we'll finalize the following:

In test_standardize_graphdataset, we need to test hb_donors, for which standardize is False. And in general all the rest of the features not indicated in the dict. In order to test it, we can use _cal_mean_std with hb_donors feature; in general, we need to verify that features of dataset which are in features_transform with standardize True have mean and dev as indicated (mean around 0 and dev around 1) and different mean and dev from the same features not touched (maybe using _cal_mean_std); we also need to verify that features of dataset which are in features_transform with standardize False or which are not in the dict have mean and dev equal to the ones not touched, again maybe using _cal_mean_std

test_feature_transform_mean_std partially tests the transformation, so we need to implement a smart way to really test transformations

I think not all the features with standardized True would got mean around 0 and dev around 1 even after transformation. For example for feature sasa, before transformation, its mean=45 & dev=41.5, and after transformation its mean=5.7 & dev=3.5. Do you have any suggestions on setting up the mean and dev range? For now I do something like this:
for key, values in features_dict.items():
            if key in features_transform: 
                transform = features_transform.get(key, {}).get('transform')
                means = []
                devs = []
                (mean, dev) = _cal_mean_std(hdf5_path, features_transform, key)
                means.append(mean)
                devs.append(dev)
                means.append(values.mean())
                devs.append(values.std())
                if transform: #test transformed features
                    assert means[0] != means[1]
                    assert devs[0] != devs[1] 
                    assert -10 < means[0] < 10
                    assert -5 < devs[0] < 5
                else: #test hb_doners, no transformation so mean & std remain the same.
                    assert means[0] == means[1]
                    assert devs[0] == devs[1]

Another way to verify that you're actually standardizing, is to do the inverse calculation (destandardizing the values) and verifying that the mean and the std dev of these back-transformed values are the same as before.

So in terms of code it would be, values being the ones after transformation (if present) and standardization: values_no_std = values * dev + mean
Then you can test that values_no_std.mean() and values_no_std.std() are the same as the ones obtained with the feature not standardized.

Chia Yu Lin added 4 commits April 18, 2023 15:42

Modify load_one_graph

aebef28

Modify transformation method

d64d503

Modify transformation method

86a2678

Fix Indentation

f031cd5

joyceljy self-assigned this Apr 18, 2023

Chia Yu Lin and others added 12 commits April 19, 2023 11:07

Fix Indentation-isNot

36149d0

Apply transformation on GridDataSet

681dd18

Apply transformation on GridDataSet(modify features that has multiple…

eae6451

… channels)

Remove _standardization

63368c8

Remove places that use _standardization

66154c4

Move feat_trans_dict to GraphDataSet

b72acae

Modify parameter

fba19e2

Modify test_graph_standardize_dictionary

a8c90c7

Fix bug in test_graph_standardize_dictionary

e3d90ee

Fix bug in test_graph_standardize_dictionary

4bf7505

Modify transform done in df_dict

438efdb

Modify params in get function & check situation for multiple channels

1c0fbc7

gcroci2 changed the title ~~feat: Impovement on Feature Standardization~~ feat: Improvement on Feature Standardization Apr 24, 2023

Giulia Crocioni and others added 12 commits May 9, 2023 15:24

fix standardization and add todos

ca272aa

Improve parameter namings and add all option in feature

e530b5c

Modify features_transform namings & remove standardize parameter in t…

85103ec

…esting

Fix linting error

1629165

Fix linting errors

d83ea2d

Fix package installation for pytorch_scatter

7cd2759

Fix linting

b93ba93

Fix linting and module error

eefbe6b

Fix all option bug and module error while testing

9da0636

Fix conflict and merge with main

1c67cc8

Fix action.yml and linting error

c15eeaa

Add dataset_type variable

3d64edc

gcroci2 reviewed May 19, 2023

View reviewed changes