-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Rank normalization of features #1062
Description
I was talking about this feature with @ogrisel and he asked me to place an issue for it.
This is a technique suggested by Yoshua Bengio to handle features with unknown scale:
Convert the features to rank scale, so the lowest rank is 0 and the highest rank is 1. This is superior to z-transform (zero mean, unit variance) because if you have one huge outlier feature, it can mess things up. But a rank-transform is robust.
You can see a description here of Python code to do this:
http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
scipy.stats.rankdata does it.
They convert to ranks, but don't normalize by the number of features.
One thing to be careful of:
If you have a lot of zeros and do a rank transform using scipy.stats.rankdata and then normalize by the number of features, they will end up have rank > 0 so you lose sparsity. I would recommend, to preserve sparsity, that you scale the range of the ranks to [0, 1] and clip any feature from the test set that exceeds the range.