Preprocessing simplification#193
Conversation
|
Cool, the code looks simpler than before. We may want to keep the name |
There was a problem hiding this comment.
What was your motivation for putting the copy argument in the constructor? The rest of the scikit usually puts it in transform. Also copy need not be grid-searched.
There was a problem hiding this comment.
I thought it is more logical to have it as a constructor parameter since you do not need to know the dimensionality of the data to set it or not. I could re-add it though.
|
As for the The most common column normalization scheme I see is unit-variance normalization that will be addressed in the |
|
|
|
I would expect an axis parameter. One use would be in Sparse PCA where U is constrained to have normalized columns, and when transforming data by projecting onto V, I had to do column normalization by hand. OTOH this is not really preprocessing. |
|
@vene I think I should expose lower level functions for your use case instead of using the |
|
agreed, it's sort of "internal use" On Thu, Jun 2, 2011 at 12:08 PM, ogrisel
|
|
First of all, thanks for working on this - IMHO that's an important issue. Regarding the
I think the latter is popular by neural net folks [1]. Regarding the best, [1] http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html |
Alright we agree.
I think the neural network folks (Lecun, Bottou, Hinton...) recommend to scale to unit variance (or even to whiten the data using We could rename We could also offer a
I agree, I will factor out a utility function called
I am +1, this is exactly what I had in mind. |
|
On Thu, Jun 2, 2011 at 8:27 PM, ogrisel
I'm rather +1 for keeping Scaler (I don't find Standardizer more
If you create such a utility function, I don't see the harm of adding
Would applying the centering only on non-zero features work? Mathieu |
Do you have any use case for column normalization in a pipeline setting? I don't see any really. The API should emphasize good / common practices IMHO.
That sounds wrong to me. Do you know a case where this was proven useful? |
|
On Thu, Jun 2, 2011 at 9:12 PM, ogrisel
Ok so let's postpone to when someone feels the need for it then. Regarding the name SampleNormalizer, I actually find it confusing.
I'm just checking with you, I've never done it. I agree it can Mathieu |
|
Alright for the name of the |
|
As for the centering of data indeed the default won't be the same for both sparse and dense. I had in mind to set the default to centering=True in the constructor but raise an exception if the data is sparse and centering is True. That means that users will have to disable centering explicitly if they have sparse data but I think it's better to be explicit rather than trying to implement "smart" / magic behavior. |
|
On Fri, Jun 3, 2011 at 3:40 AM, ogrisel
+1 |
|
Once you have added the additional tests you were thinking about and the narrative doc, I think you can merge and keep the new features for another pull request. |
|
Alright. I'll try to finish that work tomorrow. |
Requires ogrisel's preprocessing-simplification branch, pull request scikit-learn#193.
|
Alright I think this branch is now ready to be merged: sparse Scaler and binning will be implemented on other future branches. |
|
Huge job on the documentation Olivier! I'll try to write the missing parts (LabelBinarizer and KernelCenterer) next week. If it's ok that I do it in master, +1 for merge. |
|
Ok for writing the missing doc directly on master. I think Alex will have a look at the branch tomorrow. I let you guys merge yourself when you agree (I might be off-line all day tomorrow). |
There was a problem hiding this comment.
here you say you normalize rows but then you set axis=1 by default and later you say : if 1, normalize the columns instead of the rows.
There was a problem hiding this comment.
i am not sure why axis is required for normalize.
There was a problem hiding this comment.
You normalize the samples (rows) over the features (columns, hence axis=1 in numpy parlance).
|
although I understand and like the idea behind the 2 objects (Scaler for features and Normalizer for samples), I have a few remarks:
|
|
Normalize is scaling using the norms of the vectors with Scaler uses the variances. You can use a pipeline outside of a grid search. I agree it should not be put inside a grid search undless you can to measure the impact of of |
(BTW, normalizing with the L2 norm is in my experience what works best for text classification) |
+1 my point was on the functions scale and normalize. Normalizer do not expose axis.
I agree. My point is that Normalizing samples in a pipelince can cause a lot of duplicated computation if you do not pay attention. |
This is a general issue with any feature extractor / transformer: if you use PCA, ICA or NMF as the input of a classifier the issue will be even worst. I think we could extend the pipeline to make a variant that uses joblib to memoize pipelined transformers but we need to do it carefully to avoid eating all the available space on the hard drive. |
Requires ogrisel's preprocessing-simplification branch, pull request scikit-learn#193.
Here is some early progress report on a refactoring of the preprocessing package to make it simpler by combining dense and sparse variant into consistent classes.
As usual, early feedback welcomed
TODO before merge:
more tests for pathological casesnarrative documentation and more usage examplesSparse variant for the
Scaleris left for another pull request.