Rank normalization of features

I was talking about this feature with @ogrisel and he asked me to place an issue for it.

This is a technique suggested by Yoshua Bengio to handle features with unknown scale:

Convert the features to rank scale, so the lowest rank is 0 and the highest rank is 1. This is superior to z-transform (zero mean, unit variance) because if you have one huge outlier feature, it can mess things up. But a rank-transform is robust.

You can see a description here of Python code to do this:
http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
scipy.stats.rankdata does it.
They convert to ranks, but don't normalize by the number of features.

One thing to be careful of:
If you have a lot of zeros and do a rank transform using scipy.stats.rankdata and then normalize by the number of features, they will end up have rank > 0 so you lose sparsity. I would recommend, to preserve sparsity, that you scale the range of the ranks to [0, 1] and clip any feature from the test set that exceeds the range.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rank normalization of features #1062

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Rank normalization of features #1062

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions