Skip to content

Rank normalization of features #1062

@turian

Description

@turian

I was talking about this feature with @ogrisel and he asked me to place an issue for it.

This is a technique suggested by Yoshua Bengio to handle features with unknown scale:

Convert the features to rank scale, so the lowest rank is 0 and the highest rank is 1. This is superior to z-transform (zero mean, unit variance) because if you have one huge outlier feature, it can mess things up. But a rank-transform is robust.

You can see a description here of Python code to do this:
http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
scipy.stats.rankdata does it.
They convert to ranks, but don't normalize by the number of features.

One thing to be careful of:
If you have a lot of zeros and do a rank transform using scipy.stats.rankdata and then normalize by the number of features, they will end up have rank > 0 so you lose sparsity. I would recommend, to preserve sparsity, that you scale the range of the ranks to [0, 1] and clip any feature from the test set that exceeds the range.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions