feat: use dictionary for flexibly transforming and standardizing features#418
feat: use dictionary for flexibly transforming and standardizing features#418
Conversation
tests/test_dataset.py
Outdated
| hdf5_path = hdf5_path, | ||
| target='binary', | ||
| node_features = [node_feat_test], | ||
| edge_features =[], |
There was a problem hiding this comment.
| edge_features =[], | |
| edge_features = None, |
There was a problem hiding this comment.
From the document of GraphDataSet, we cannot pass the edge_features as None.
So I will keep edge_features =[].
There was a problem hiding this comment.
From the docs, it is an Optional parameter, which means that it can be None :)
gcroci2
left a comment
There was a problem hiding this comment.
Very useful and nice changes :) I left minor comments, please ask my review again once you're done, we're almost there! Also in general, please leave a space before = symbol and after it, and also a space after punctuation like , and :
We'll look at the PR together once you implement these changes and we'll finalize the following:
- In
test_standardize_graphdataset, we need to testhb_donors, for which standardize is False. And in general all the rest of the features not indicated in the dict. In order to test it, we can use_cal_mean_stdwithhb_donorsfeature; in general, we need to verify that features ofdatasetwhich are infeatures_transformwith standardize True have mean and dev as indicated (mean around 0 and dev around 1) and different mean and dev from the same features not touched (maybe using_cal_mean_std); we also need to verify that features ofdatasetwhich are infeatures_transformwith standardize False or which are not in the dict have mean and dev equal to the ones not touched, again maybe using_cal_mean_std test_feature_transform_mean_stdpartially tests the transformation, so we need to implement a smart way to really test transformations
I think not all the features with standardized True would got mean around 0 and dev around 1 even after transformation. For example for feature sasa, before transformation, its mean=45 & dev=41.5, and after transformation its mean=5.7 & dev=3.5. |
Another way to verify that you're actually standardizing, is to do the inverse calculation (destandardizing the values) and verifying that the mean and the std dev of these back-transformed values are the same as before. So in terms of code it would be, |
Improvement on feature standardization by applying suitable transformation for different features.
For now, we will modify the package by adding a dictionary indicating the type of transformation in which the feature should apply. The detailed description are as follows:
electrostaticandvanderwaalscube version is better than the original one).res_size,res_charge,hb_donors,hb_acceptors,hse,irc_features,res_mass,res_pI,distanceres_depth,bsa,info_contentsasaelectrostatic,vanderwaalspssmsince it's not correctly computedFurther steps:
Implement a user-defined dictionary, the user can decide which transformation they want to apply to each feature and insert into the dictionary.