Skip to content

[MRG] ENH Add example comparing the distribution of all scaling preprocessor#2

Merged
glemaitre merged 10 commits intoglemaitre:quantile_scalerfrom
raghavrv:quantile_scaler_plot
Feb 22, 2017
Merged

[MRG] ENH Add example comparing the distribution of all scaling preprocessor#2
glemaitre merged 10 commits intoglemaitre:quantile_scalerfrom
raghavrv:quantile_scaler_plot

Conversation

@raghavrv
Copy link
Copy Markdown
Collaborator

@raghavrv raghavrv commented Feb 16, 2017

Main PR: scikit-learn#8363

I have some initial plots comparing different scalers in the california housing dataset...
I binarized the targets to help visualize it better...

For quantile normalizer

image

image

@glemaitre @tguillemot @ogrisel @dengemann @jnothman

@glemaitre
Copy link
Copy Markdown
Owner

@raghavrv can you plot the robust scaler?

@dengemann
Copy link
Copy Markdown

That looks pretty cool! Indeed it seems to support the point made by @ogrisel about the potentially decorrelating nature of this non-linear transform.

@ogrisel
Copy link
Copy Markdown

ogrisel commented Feb 16, 2017

I don't understand why the normalized data looks discretized on the y axis. Aren't we linearly interpolating points between quantiles?

EDIT by @raghavrv (This was for an old deleted plot)

@glemaitre
Copy link
Copy Markdown
Owner

@ogrisel Checking with @raghavrv IRL, it was a mistake.

@raghavrv
Copy link
Copy Markdown
Collaborator Author

raghavrv commented Feb 16, 2017

@ogrisel Sorry I plotted the wrong features... Here Have updated at the PR description the plots for all the scalers / normalizers. You can see our QN performing great!

@raghavrv
Copy link
Copy Markdown
Collaborator Author

Note that I didn't binarize it this time and instead used matplotlib's colormap...

@ogrisel
Copy link
Copy Markdown

ogrisel commented Feb 16, 2017

The plots looks great!

I find it weird that robust scaler is yielding such a flat profile. How many "outliers" are there on the y axis? I would have assumed that IQR scaling should have performed better.

@ogrisel
Copy link
Copy Markdown

ogrisel commented Feb 16, 2017

Please add x and y labels on the first 2 plots (data without transform and after zoom on non-outliers).

@raghavrv
Copy link
Copy Markdown
Collaborator Author

How many "outliers" are there on the y axis? I would have assumed that IQR scaling should have performed better.

It scales the outliers by the IQR, so the outliers will still be outliers in the scaled data and hence the flat profile... We could zoom in like it is done here, but it would be unfair to the other scalers I think....

@jnothman
Copy link
Copy Markdown

jnothman commented Feb 16, 2017 via email

@jnothman
Copy link
Copy Markdown

jnothman commented Feb 16, 2017 via email

@ogrisel
Copy link
Copy Markdown

ogrisel commented Feb 20, 2017

or show 0-1 and mark outliers along edges with x's

I am not sure how showing the outliers would work, but I like the idea of zooming on the [0.01-0.99] quantiles for all the transforms (in addition to the original, unzoomed plot for each transform).

@raghavrv
Copy link
Copy Markdown
Collaborator Author

raghavrv commented Feb 20, 2017

@ogrisel @jnothman Okay this is the re-updated plot :)

image

@ogrisel
Copy link
Copy Markdown

ogrisel commented Feb 20, 2017

Very nice!

@ogrisel
Copy link
Copy Markdown

ogrisel commented Feb 20, 2017

Maybe you could remove the color bar from the first plots and only display it for the last plot as they all use the same color scale.

@dengemann
Copy link
Copy Markdown

Great work! It looks very helpful and clear.

@tguillemot
Copy link
Copy Markdown
Collaborator

Nice work @raghavrv !

axes = subplots[offset + 2: offset + 4] + subplots[offset + 6: offset + 8]
plot_distribution(axes, X[non_outliers], y[non_outliers], hist_nbins=50,
plot_title=(title +
"\n(Zoomed-in at quantile range [0, 99))"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantile -> percentile.

Also, I think parentheses enclosing these subtitles can be removed.

@jnothman
Copy link
Copy Markdown

Great stuff!

@@ -0,0 +1,133 @@
#!/usr/bin/python
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#!/usr/bin/env python should be better

outliers that can make visualization of the data difficult.

Also linear models like :class:`sklearn.linear_model.SVM` require data which is
normalized to the range [-1, 1].
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approximately normalized to the [-1, 1] or [0, 1] range, or at the very least have all the features on the same scale.

X_full, y_full = dataset.data, dataset.target

# Take only 2 features to make visualization easier
# Feature 0 has a tapering distribution of outliers
Copy link
Copy Markdown

@ogrisel ogrisel Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature of 0 has a long tail distribution. Feature 5 has a few but very large outliers.

# Blank space to avoid overlapping of plots;
# plt.tight_layout does not work with gridspec, height of this row is
# adjusted at `height_ratios` param given to `GridSpec`
plt.subplot(gs[offset+12: offset+16]).axis('off')
Copy link
Copy Markdown

@ogrisel ogrisel Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetics: gs[offset + 12:offset + 16] or gs[offset+12:offset+16] but no spacing around the slicing operator : and especially not asymmetric, otherwise it looks like the literal notation of a dict.

@raghavrv raghavrv force-pushed the quantile_scaler_plot branch from 1f8538a to 8272ec1 Compare February 21, 2017 16:11
@raghavrv
Copy link
Copy Markdown
Collaborator Author

raghavrv commented Feb 21, 2017

Thanks all for the comments :)

Any idea on how the doc can be improved? Should it have a note for each scaler/norm. stating it's use case?

(Have updated the above comment with the new plot)

@raghavrv raghavrv changed the title [WIP] ENH Add example comparing the distribution of all scaling preprocessor [MRG] ENH Add example comparing the distribution of all scaling preprocessor Feb 21, 2017
@ogrisel
Copy link
Copy Markdown

ogrisel commented Feb 21, 2017

Any idea on how the doc can be improved? Should it have a note for each scaler/norm. stating it's use case?

Yes I think that would be useful to give a high level analysis of the behavior on each w.r.t. this specific 2D dataset.

@jnothman
Copy link
Copy Markdown

I'm now wondering if there's any value showing the transformations on held out data

@ogrisel
Copy link
Copy Markdown

ogrisel commented Feb 21, 2017

I'm now wondering if there's any value showing the transformations on held out data

They should look very similar to the data on the training set, right? Adding plots for held out data would render this example quite complex. I am not sure it's worth the additional cognitive load.

@jnothman
Copy link
Copy Markdown

jnothman commented Feb 21, 2017 via email

@glemaitre glemaitre merged commit 88578df into glemaitre:quantile_scaler Feb 22, 2017
@raghavrv raghavrv deleted the quantile_scaler_plot branch February 22, 2017 12:53
@glemaitre
Copy link
Copy Markdown
Owner

Merged!!!

@raghavrv
Copy link
Copy Markdown
Collaborator Author

Thanks for merging! 🎉 I'll edit the docs in the main branch...

glemaitre pushed a commit that referenced this pull request Apr 7, 2017
…ocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang
glemaitre pushed a commit that referenced this pull request Apr 8, 2017
…ocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang
glemaitre added a commit that referenced this pull request Jun 10, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
glemaitre pushed a commit that referenced this pull request Jun 27, 2017
* add test for _preprocess_data and make it consistent

* fix pep8

* add doc, cast systematically y in X.dtype and update test_coordinate_descent.py

* test if input values don't change with copy=True

* test if input values don't change with copy=True #2

* fix doc

* fix doc #2

* fix doc #3
GaelVaroquaux pushed a commit that referenced this pull request Jul 17, 2018
…y calculation (scikit-learn#11464)

* Fix to allow M

* Updated MAE test to consider sample_weights in calculation

* Removed comment

* Fixed: E501 line too long (82 > 79 characters)

* syntax correction

* Added fix details

* Changed to use consistent datatypes during calculaions

* Corrected formatting

* Requested Changes

* removed explicit casts

* Removed unnecessary explicits

* Removed unnecessary explicit casts

* added additional test

* updated comments

* Requested changes incl additional unit test

* fix mistake

* formatting

* removed whitespace

* added test notes

* formatting

* Requested changes

* Trailing space fix attempt

* Trailing whitespace fix attempt #2

* Remove trailing whitespace
GaelVaroquaux pushed a commit that referenced this pull request Jul 17, 2018
* Add averaging option to AMI and NMI

Leave current behavior unchanged

* Flake8 fixes

* Incorporate tests of means for AMI and NMI

* Add note about `average_method` in NMI

* Update docs from AMI, NMI changes (#1)

* Correct the NMI and AMI descriptions in docs

* Update docstrings due to averaging changes

- V-measure
- Homogeneity
- Completeness
- NMI
- AMI

* Update documentation and remove nose tests (#2)

* Update v0.20.rst

* Update test_supervised.py

* Update clustering.rst

* Fix multiple spaces after operator

* Rename all arguments

* No more arbitrary values!

* Improve handling of floating-point imprecision

* Clearly state when the change occurs

* Update AMI/NMI docs

* Update v0.20.rst

* Catch FutureWarnings in AMI and NMI
glemaitre pushed a commit that referenced this pull request Jul 20, 2018
initial PR commit

seq_dataset.pyx generated from template

seq_dataset.pyx generated from template #2

rename variables

fused types consistency test for seq_dataset

a

sklearn/utils/tests/test_seq_dataset.py

new if statement

add doc

sklearn/utils/seq_dataset.pyx.tp

minor changes

minor changes

typo fix

check numeric accuracy only up 5th decimal

Address oliver's request for changing test name

add test for make_dataset and rename a variable in test_seq_dataset
glemaitre pushed a commit that referenced this pull request Feb 27, 2019
…13243)

* Remove unused code

* Squash all the PR 9040 commits

initial PR commit

seq_dataset.pyx generated from template

seq_dataset.pyx generated from template #2

rename variables

fused types consistency test for seq_dataset

a

sklearn/utils/tests/test_seq_dataset.py

new if statement

add doc

sklearn/utils/seq_dataset.pyx.tp

minor changes

minor changes

typo fix

check numeric accuracy only up 5th decimal

Address oliver's request for changing test name

add test for make_dataset and rename a variable in test_seq_dataset

* FIX tests

* TST more numerically stable test_sgd.test_tol_parameter

* Added benchmarks to compare SAGA 32b and 64b

* Fixing gael's comments

* fix

* solve some issues

* PEP8

* Address lesteve comments

* fix merging

* avoid using assert_equal

* use all_close

* use explicit ArrayDataset64 and CSRDataset64

* fix: remove unused import

* Use parametrized to cover ArrayDaset-CSRDataset-32-64 matrix

* for consistency use 32 first then 64 + add 64 suffix to variables

* it would be cool if this worked !!!

* more verbose version

* revert SGD changes as much as possible.

* Add solvers back to bench_saga

* make 64 explicit in the naming

* remove checking native python type + add comparison between 32 64

* Add whatsnew with everyone with commits

* simplify a bit the testing

* simplify the parametrize

* update whatsnew

* fix pep8
glemaitre pushed a commit that referenced this pull request Apr 24, 2019
* initial commit

* used random class

* fixed failing testcases, reverted __init__.py

* fixed failing testcases #2
- passed rng as parameter to ParameterSampler class
- changed seed from 0 to 42 (as original)

* fixed failing testcases #2
- passed rng as parameter to SparseRandomProjection class

* fixed failing testcases #4
- passed rng as parameter to GaussianRandomProjection class

* fixed failing test case because of flake 8
glemaitre pushed a commit that referenced this pull request Jun 20, 2019
master merge with destro latest
glemaitre pushed a commit that referenced this pull request Aug 16, 2020
…scikit-learn#10591)

* Initial add DET curve to classification metrics

* Add DET to exports

* Fix DET-curve doctest errors

- Sample snippet in  model_evaluation documentation was outdated.

* Clarify wording in DET-curve computation

- Align to the wording of ranking module to make it consistent.
- Add correct describtion of input and outputs.
- Update and fix non-existent links

* Beautify DET curve documentation source

- Limit line length to 80 characters.

* Expand DET curve documentation

- Add an example plot to show difference between ROC and DET curves.
- Expand Usage Note section with background information and properties
of DET curves.

* Update DET-curve documentation

- Fix typos and some grammar improvements.
- Use named references to avoid potential conflicts with other sections.
- Remove unneeded references and improved existing ones by using e.g.
using versioned links.

* Select relevant DET points using slice object

* Remove some dubiety from DET curve doc-string

* Add DET curve contributors

* Add tests for DET curves

* Streamline DET test by using parametrization

* Increase verbosity of DET curve error handling

- Explicitly sanity check input before computing a DET curve.
- Add test for perfect scores.
- Adapt indentation style to match the test module.

* Add reference for DET curves in invariance test

* Add automated invariance checks for DET curves

* Resolve merge artifacts

* Make doctest happy

* Fix whitespaces for doctest

* Revert unintended whitespace changes

* Revert unintended white space changes #2

* Fix typos and grammar

* Fix white space in doc

* Streamline test code

* Remove rebase artifacts

* Fix PR link in doc

* Fix test_ranking

* Fix rebase errors

* Fix import

* Bring back newlines

- Swallowed by copy/paste

* Remove uncited ref link

* Remove matplotlib deprecation warning

* Bring back hidden reference

* Add motivation to DET example

* Fix lint

* Add citation

* Use modern matplotlib API

Co-authored-by: Jeremy Karnowski <jeremy.karnowski@gmail.com>
Co-authored-by: Julien Cornebise <julien@cornebise.com>
Co-authored-by: Daniel Mohns <daniel.mohns@zenguard.org>
glemaitre pushed a commit that referenced this pull request Oct 21, 2023
glemaitre pushed a commit that referenced this pull request May 7, 2025
Strategy to forward prefix parameter when creating blocks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants