Skip to content

feat: use dictionary for flexibly transforming and standardizing features#418

Merged
gcroci2 merged 50 commits intomainfrom
Feat_Standardization_joyceljy
May 24, 2023
Merged

feat: use dictionary for flexibly transforming and standardizing features#418
gcroci2 merged 50 commits intomainfrom
Feat_Standardization_joyceljy

Conversation

@joyceljy
Copy link
Copy Markdown
Collaborator

@joyceljy joyceljy commented Apr 18, 2023

Improvement on feature standardization by applying suitable transformation for different features.
For now, we will modify the package by adding a dictionary indicating the type of transformation in which the feature should apply. The detailed description are as follows:

  • We won't normalize nor standardize one-hot encoded features. For motivation, see for example this thread (this is the common opinion among the community).
  • In general, the rule of thumb for deciding when applying transformation before standardization is to have a distribution that widespreads the features values, ideally resembling a Gaussian, but not mandatorily (for example electrostatic and vanderwaals cube version is better than the original one).
  • Features to which we won't apply log, but standardization directly: res_size, res_charge, hb_donors, hb_acceptors, hse, irc_ features, res_mass, res_pI, distance
  • Features to which we'll apply log(x+1): res_depth, bsa, info_content
  • Features to which we'll apply square root: sasa
  • Features to which we'll apply cube root: electrostatic, vanderwaals
  • We'll remove for now pssm since it's not correctly computed

Further steps:
Implement a user-defined dictionary, the user can decide which transformation they want to apply to each feature and insert into the dictionary.

@joyceljy joyceljy self-assigned this Apr 18, 2023
@gcroci2 gcroci2 changed the title feat: Impovement on Feature Standardization feat: Improvement on Feature Standardization Apr 24, 2023
hdf5_path = hdf5_path,
target='binary',
node_features = [node_feat_test],
edge_features =[],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
edge_features =[],
edge_features = None,

Copy link
Copy Markdown
Collaborator Author

@joyceljy joyceljy May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the document of GraphDataSet, we cannot pass the edge_features as None.
So I will keep edge_features =[].

Copy link
Copy Markdown
Collaborator

@gcroci2 gcroci2 May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docs, it is an Optional parameter, which means that it can be None :)

Copy link
Copy Markdown
Collaborator

@gcroci2 gcroci2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful and nice changes :) I left minor comments, please ask my review again once you're done, we're almost there! Also in general, please leave a space before = symbol and after it, and also a space after punctuation like , and :

We'll look at the PR together once you implement these changes and we'll finalize the following:

  • In test_standardize_graphdataset, we need to test hb_donors, for which standardize is False. And in general all the rest of the features not indicated in the dict. In order to test it, we can use _cal_mean_std with hb_donors feature; in general, we need to verify that features of dataset which are in features_transform with standardize True have mean and dev as indicated (mean around 0 and dev around 1) and different mean and dev from the same features not touched (maybe using _cal_mean_std); we also need to verify that features of dataset which are in features_transform with standardize False or which are not in the dict have mean and dev equal to the ones not touched, again maybe using _cal_mean_std
  • test_feature_transform_mean_std partially tests the transformation, so we need to implement a smart way to really test transformations

@joyceljy joyceljy requested a review from gcroci2 May 20, 2023 12:02
@joyceljy
Copy link
Copy Markdown
Collaborator Author

joyceljy commented May 22, 2023

Very useful and nice changes :) I left minor comments, please ask my review again once you're done, we're almost there! Also in general, please leave a space before = symbol and after it, and also a space after punctuation like , and :

We'll look at the PR together once you implement these changes and we'll finalize the following:

  • In test_standardize_graphdataset, we need to test hb_donors, for which standardize is False. And in general all the rest of the features not indicated in the dict. In order to test it, we can use _cal_mean_std with hb_donors feature; in general, we need to verify that features of dataset which are in features_transform with standardize True have mean and dev as indicated (mean around 0 and dev around 1) and different mean and dev from the same features not touched (maybe using _cal_mean_std); we also need to verify that features of dataset which are in features_transform with standardize False or which are not in the dict have mean and dev equal to the ones not touched, again maybe using _cal_mean_std
  • test_feature_transform_mean_std partially tests the transformation, so we need to implement a smart way to really test transformations

I think not all the features with standardized True would got mean around 0 and dev around 1 even after transformation. For example for feature sasa, before transformation, its mean=45 & dev=41.5, and after transformation its mean=5.7 & dev=3.5.
Do you have any suggestions on setting up the mean and dev range? For now I do something like this:

for key, values in features_dict.items():
            if key in features_transform: 
                transform = features_transform.get(key, {}).get('transform')
                means = []
                devs = []
                (mean, dev) = _cal_mean_std(hdf5_path, features_transform, key)
                means.append(mean)
                devs.append(dev)
                means.append(values.mean())
                devs.append(values.std())
                if transform: #test transformed features
                    assert means[0] != means[1]
                    assert devs[0] != devs[1] 
                    assert -10 < means[0] < 10
                    assert -5 < devs[0] < 5
                else: #test hb_doners, no transformation so mean & std remain the same.
                    assert means[0] == means[1]
                    assert devs[0] == devs[1]

@gcroci2
Copy link
Copy Markdown
Collaborator

gcroci2 commented May 22, 2023

Very useful and nice changes :) I left minor comments, please ask my review again once you're done, we're almost there! Also in general, please leave a space before = symbol and after it, and also a space after punctuation like , and :
We'll look at the PR together once you implement these changes and we'll finalize the following:

  • In test_standardize_graphdataset, we need to test hb_donors, for which standardize is False. And in general all the rest of the features not indicated in the dict. In order to test it, we can use _cal_mean_std with hb_donors feature; in general, we need to verify that features of dataset which are in features_transform with standardize True have mean and dev as indicated (mean around 0 and dev around 1) and different mean and dev from the same features not touched (maybe using _cal_mean_std); we also need to verify that features of dataset which are in features_transform with standardize False or which are not in the dict have mean and dev equal to the ones not touched, again maybe using _cal_mean_std
  • test_feature_transform_mean_std partially tests the transformation, so we need to implement a smart way to really test transformations

I think not all the features with standardized True would got mean around 0 and dev around 1 even after transformation. For example for feature sasa, before transformation, its mean=45 & dev=41.5, and after transformation its mean=5.7 & dev=3.5. Do you have any suggestions on setting up the mean and dev range? For now I do something like this:

for key, values in features_dict.items():
            if key in features_transform: 
                transform = features_transform.get(key, {}).get('transform')
                means = []
                devs = []
                (mean, dev) = _cal_mean_std(hdf5_path, features_transform, key)
                means.append(mean)
                devs.append(dev)
                means.append(values.mean())
                devs.append(values.std())
                if transform: #test transformed features
                    assert means[0] != means[1]
                    assert devs[0] != devs[1] 
                    assert -10 < means[0] < 10
                    assert -5 < devs[0] < 5
                else: #test hb_doners, no transformation so mean & std remain the same.
                    assert means[0] == means[1]
                    assert devs[0] == devs[1]

Another way to verify that you're actually standardizing, is to do the inverse calculation (destandardizing the values) and verifying that the mean and the std dev of these back-transformed values are the same as before.

So in terms of code it would be, values being the ones after transformation (if present) and standardization: values_no_std = values * dev + mean
Then you can test that values_no_std.mean() and values_no_std.std() are the same as the ones obtained with the feature not standardized.

@gcroci2 gcroci2 merged commit 76e0345 into main May 24, 2023
@gcroci2 gcroci2 deleted the Feat_Standardization_joyceljy branch May 24, 2023 10:26
@gcroci2 gcroci2 changed the title feat: Improvement on Feature Standardization feat: use dictionary for flexibly transforming and standardizing features Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants