[MRG] ENH Add example comparing the distribution of all scaling preprocessor#2
Conversation
|
@raghavrv can you plot the robust scaler? |
|
That looks pretty cool! Indeed it seems to support the point made by @ogrisel about the potentially decorrelating nature of this non-linear transform. |
|
I don't understand why the normalized data looks discretized on the y axis. Aren't we linearly interpolating points between quantiles? EDIT by @raghavrv (This was for an old deleted plot) |
|
@ogrisel Sorry I plotted the wrong features... |
|
Note that I didn't binarize it this time and instead used matplotlib's colormap... |
|
The plots looks great! I find it weird that robust scaler is yielding such a flat profile. How many "outliers" are there on the y axis? I would have assumed that IQR scaling should have performed better. |
|
Please add x and y labels on the first 2 plots (data without transform and after zoom on non-outliers). |
It scales the outliers by the IQR, so the outliers will still be outliers in the scaled data and hence the flat profile... We could zoom in like it is done here, but it would be unfair to the other scalers I think.... |
|
Maybe best to just show 0-1 range after scaling.
…On 17 Feb 2017 6:00 am, "(Venkat) Raghav (Rajagopalan)" < ***@***.***> wrote:
How many "outliers" are there on the y axis? I would have assumed that IQR
scaling should have performed better.
It scales the outliers by the IQR, so the outliers will still be outliers
in the scaled data and hence the flat profile... We could zoom in like it
is done here
<http://scikit-learn.org/stable/auto_examples/preprocessing/plot_robust_scaling.html#sphx-glr-auto-examples-preprocessing-plot-robust-scaling-py>,
but it would be unfair to the other scalers I think....
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67GG3Di4cV0eFJZPrZjpGq6hd453ks5rdJzBgaJpZM4MCmli>
.
|
|
or show 0-1 and mark outliers along edges with x's
…On 17 Feb 2017 7:46 am, "Joel Nothman" ***@***.***> wrote:
Maybe best to just show 0-1 range after scaling.
On 17 Feb 2017 6:00 am, "(Venkat) Raghav (Rajagopalan)" <
***@***.***> wrote:
> How many "outliers" are there on the y axis? I would have assumed that
> IQR scaling should have performed better.
>
> It scales the outliers by the IQR, so the outliers will still be outliers
> in the scaled data and hence the flat profile... We could zoom in like it
> is done here
> <http://scikit-learn.org/stable/auto_examples/preprocessing/plot_robust_scaling.html#sphx-glr-auto-examples-preprocessing-plot-robust-scaling-py>,
> but it would be unfair to the other scalers I think....
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#2 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz67GG3Di4cV0eFJZPrZjpGq6hd453ks5rdJzBgaJpZM4MCmli>
> .
>
|
I am not sure how showing the outliers would work, but I like the idea of zooming on the [0.01-0.99] quantiles for all the transforms (in addition to the original, unzoomed plot for each transform). |
|
Very nice! |
|
Maybe you could remove the color bar from the first plots and only display it for the last plot as they all use the same color scale. |
|
Great work! It looks very helpful and clear. |
|
Nice work @raghavrv ! |
| axes = subplots[offset + 2: offset + 4] + subplots[offset + 6: offset + 8] | ||
| plot_distribution(axes, X[non_outliers], y[non_outliers], hist_nbins=50, | ||
| plot_title=(title + | ||
| "\n(Zoomed-in at quantile range [0, 99))"), |
There was a problem hiding this comment.
quantile -> percentile.
Also, I think parentheses enclosing these subtitles can be removed.
|
Great stuff! |
| @@ -0,0 +1,133 @@ | |||
| #!/usr/bin/python | |||
There was a problem hiding this comment.
#!/usr/bin/env python should be better
| outliers that can make visualization of the data difficult. | ||
|
|
||
| Also linear models like :class:`sklearn.linear_model.SVM` require data which is | ||
| normalized to the range [-1, 1]. |
There was a problem hiding this comment.
approximately normalized to the [-1, 1] or [0, 1] range, or at the very least have all the features on the same scale.
| X_full, y_full = dataset.data, dataset.target | ||
|
|
||
| # Take only 2 features to make visualization easier | ||
| # Feature 0 has a tapering distribution of outliers |
There was a problem hiding this comment.
Feature of 0 has a long tail distribution. Feature 5 has a few but very large outliers.
| # Blank space to avoid overlapping of plots; | ||
| # plt.tight_layout does not work with gridspec, height of this row is | ||
| # adjusted at `height_ratios` param given to `GridSpec` | ||
| plt.subplot(gs[offset+12: offset+16]).axis('off') |
There was a problem hiding this comment.
cosmetics: gs[offset + 12:offset + 16] or gs[offset+12:offset+16] but no spacing around the slicing operator : and especially not asymmetric, otherwise it looks like the literal notation of a dict.
1f8538a to
8272ec1
Compare
|
Thanks all for the comments :) Any idea on how the doc can be improved? Should it have a note for each scaler/norm. stating it's use case? (Have updated the above comment with the new plot) |
Yes I think that would be useful to give a high level analysis of the behavior on each w.r.t. this specific 2D dataset. |
|
I'm now wondering if there's any value showing the transformations on held out data |
They should look very similar to the data on the training set, right? Adding plots for held out data would render this example quite complex. I am not sure it's worth the additional cognitive load. |
|
Happy to leave it. Yes it should look similar except for minmax.
…On 22 February 2017 at 09:58, Olivier Grisel ***@***.***> wrote:
I'm now wondering if there's any value showing the transformations on held
out data
They should look very similarly to the data on the training set, right?
Adding plots for held out data would render this example quite complex. I
am not sure it's worth the cognitive load.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xI9nx69qG931FOyYTqHI3JLy0rFks5re2wQgaJpZM4MCmli>
.
|
|
Merged!!! |
|
Thanks for merging! 🎉 I'll edit the docs in the main branch... |
…ocessor (#2) * ENH Add example comparing the distribution of all scaling preprocessor * Remove Jupyter notebook convert * FIX/ENH Select feat before not after; Plot interquantile data range for all * Add heatmap legend * Remove comment maybe? * Move doc from robust_scaling to plot_all_scaling; Need to update doc * Update the doc * Better aesthetics; Better spacing and plot colormap only at end * Shameless author re-ordering ;P * Use env python for she-bang
…ocessor (#2) * ENH Add example comparing the distribution of all scaling preprocessor * Remove Jupyter notebook convert * FIX/ENH Select feat before not after; Plot interquantile data range for all * Add heatmap legend * Remove comment maybe? * Move doc from robust_scaling to plot_all_scaling; Need to update doc * Update the doc * Better aesthetics; Better spacing and plot colormap only at end * Shameless author re-ordering ;P * Use env python for she-bang
* resurrect quantile scaler * move the code in the pre-processing module * first draft * Add tests. * Fix bug in QuantileNormalizer. * Add quantile_normalizer. * Implement pickling * create a specific function for dense transform * Create a fit function for the dense case * Create a toy examples * First draft with sparse matrices * remove useless functions and non-negative sparse compatibility * fix slice call * Fix tests of QuantileNormalizer. * Fix estimator compatibility * List of functions became tuple of functions * Check X consistency at transform and inverse transform time * fix doc * Add negative ValueError tests for QuantileNormalizer. * Fix cosmetics * Fix compatibility numpy <= 1.8 * Add n_features tests and correct ValueError. * PEP8 * fix fill_value for early scipy compatibility * simplify sampling * Fix tests. * removing last pring * Change choice for permutation * cosmetics * fix remove remaining choice * DOC * Fix inconsistencies * pep8 * Add checker for init parameters. * hack bounds and make a test * FIX/TST bounds are provided by the fitting and not X at transform * PEP8 * FIX/TST axis should be <= 1 * PEP8 * ENH Add parameter ignore_implicit_zeros * ENH match output distribution * ENH clip the data to avoid infinity due to output PDF * FIX ENH restraint to uniform and norm * [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2) * ENH Add example comparing the distribution of all scaling preprocessor * Remove Jupyter notebook convert * FIX/ENH Select feat before not after; Plot interquantile data range for all * Add heatmap legend * Remove comment maybe? * Move doc from robust_scaling to plot_all_scaling; Need to update doc * Update the doc * Better aesthetics; Better spacing and plot colormap only at end * Shameless author re-ordering ;P * Use env python for she-bang * TST Validity of output_pdf * EXA Use OrderedDict; Make it easier to add more transformations * FIX PEP8 and replace scipy.stats by str in example * FIX remove useless import * COSMET change variable names * FIX change output_pdf occurence to output_distribution * FIX partial fixies from comments * COMIT change class name and code structure * COSMIT change direction to inverse * FIX factorize transform in _transform_col * PEP8 * FIX change the magic 10 * FIX add interp1d to fixes * FIX/TST allow negative entries when ignore_implicit_zeros is True * FIX use np.interp instead of sp.interpolate.interp1d * FIX/TST fix tests * DOC start checking doc * TST add test to check the behaviour of interp numpy * TST/EHN Add the possibility to add noise to compute quantile * FIX factorize quantile computation * FIX fixes issues * PEP8 * FIX/DOC correct doc * TST/DOC improve doc and add random state * EXA add examples to illustrate the use of smoothing_noise * FIX/DOC fix some grammar * DOC fix example * DOC/EXA make plot titles more succint * EXA improve explanation * EXA improve the docstring * DOC add a bit more documentation * FIX advance review * TST add subsampling test * DOC/TST better example for the docstring * DOC add ellipsis to docstring * FIX address olivier comments * FIX remove random_state in sparse.rand * FIX spelling doc * FIX cite example in user guide and docstring * FIX olivier comments * EHN improve the example comparing all the pre-processing methods * FIX/DOC remove title * FIX change the scaling of the figure * FIX plotting layout * FIX ratio w/h * Reorder and reword the plot_all_scaling example * Fix aspect ratio and better explanations in the plot_all_scaling.py example * Fix broken link and remove useless sentence * FIX fix couples of spelling * FIX comments joel * FIX/DOC address documentation comments * FIX address comments joel * FIX inline sparse and dense transform * PEP8 * TST/DOC temporary skipping test * FIX raise an error if n_quantiles > subsample * FIX wording in smoothing_noise example * EXA Denis comments * FIX rephrasing * FIX make smoothing_noise to be a boolearn and change doc * FIX address comments * FIX verbose the doc slightly more * PEP8/DOC * ENH: 2-ways interpolation to avoid smoothing_noise Simplifies also the code, examples, and documentation
…y calculation (scikit-learn#11464) * Fix to allow M * Updated MAE test to consider sample_weights in calculation * Removed comment * Fixed: E501 line too long (82 > 79 characters) * syntax correction * Added fix details * Changed to use consistent datatypes during calculaions * Corrected formatting * Requested Changes * removed explicit casts * Removed unnecessary explicits * Removed unnecessary explicit casts * added additional test * updated comments * Requested changes incl additional unit test * fix mistake * formatting * removed whitespace * added test notes * formatting * Requested changes * Trailing space fix attempt * Trailing whitespace fix attempt #2 * Remove trailing whitespace
* Add averaging option to AMI and NMI Leave current behavior unchanged * Flake8 fixes * Incorporate tests of means for AMI and NMI * Add note about `average_method` in NMI * Update docs from AMI, NMI changes (#1) * Correct the NMI and AMI descriptions in docs * Update docstrings due to averaging changes - V-measure - Homogeneity - Completeness - NMI - AMI * Update documentation and remove nose tests (#2) * Update v0.20.rst * Update test_supervised.py * Update clustering.rst * Fix multiple spaces after operator * Rename all arguments * No more arbitrary values! * Improve handling of floating-point imprecision * Clearly state when the change occurs * Update AMI/NMI docs * Update v0.20.rst * Catch FutureWarnings in AMI and NMI
initial PR commit seq_dataset.pyx generated from template seq_dataset.pyx generated from template #2 rename variables fused types consistency test for seq_dataset a sklearn/utils/tests/test_seq_dataset.py new if statement add doc sklearn/utils/seq_dataset.pyx.tp minor changes minor changes typo fix check numeric accuracy only up 5th decimal Address oliver's request for changing test name add test for make_dataset and rename a variable in test_seq_dataset
…13243) * Remove unused code * Squash all the PR 9040 commits initial PR commit seq_dataset.pyx generated from template seq_dataset.pyx generated from template #2 rename variables fused types consistency test for seq_dataset a sklearn/utils/tests/test_seq_dataset.py new if statement add doc sklearn/utils/seq_dataset.pyx.tp minor changes minor changes typo fix check numeric accuracy only up 5th decimal Address oliver's request for changing test name add test for make_dataset and rename a variable in test_seq_dataset * FIX tests * TST more numerically stable test_sgd.test_tol_parameter * Added benchmarks to compare SAGA 32b and 64b * Fixing gael's comments * fix * solve some issues * PEP8 * Address lesteve comments * fix merging * avoid using assert_equal * use all_close * use explicit ArrayDataset64 and CSRDataset64 * fix: remove unused import * Use parametrized to cover ArrayDaset-CSRDataset-32-64 matrix * for consistency use 32 first then 64 + add 64 suffix to variables * it would be cool if this worked !!! * more verbose version * revert SGD changes as much as possible. * Add solvers back to bench_saga * make 64 explicit in the naming * remove checking native python type + add comparison between 32 64 * Add whatsnew with everyone with commits * simplify a bit the testing * simplify the parametrize * update whatsnew * fix pep8
* initial commit * used random class * fixed failing testcases, reverted __init__.py * fixed failing testcases #2 - passed rng as parameter to ParameterSampler class - changed seed from 0 to 42 (as original) * fixed failing testcases #2 - passed rng as parameter to SparseRandomProjection class * fixed failing testcases #4 - passed rng as parameter to GaussianRandomProjection class * fixed failing test case because of flake 8
master merge with destro latest
…scikit-learn#10591) * Initial add DET curve to classification metrics * Add DET to exports * Fix DET-curve doctest errors - Sample snippet in model_evaluation documentation was outdated. * Clarify wording in DET-curve computation - Align to the wording of ranking module to make it consistent. - Add correct describtion of input and outputs. - Update and fix non-existent links * Beautify DET curve documentation source - Limit line length to 80 characters. * Expand DET curve documentation - Add an example plot to show difference between ROC and DET curves. - Expand Usage Note section with background information and properties of DET curves. * Update DET-curve documentation - Fix typos and some grammar improvements. - Use named references to avoid potential conflicts with other sections. - Remove unneeded references and improved existing ones by using e.g. using versioned links. * Select relevant DET points using slice object * Remove some dubiety from DET curve doc-string * Add DET curve contributors * Add tests for DET curves * Streamline DET test by using parametrization * Increase verbosity of DET curve error handling - Explicitly sanity check input before computing a DET curve. - Add test for perfect scores. - Adapt indentation style to match the test module. * Add reference for DET curves in invariance test * Add automated invariance checks for DET curves * Resolve merge artifacts * Make doctest happy * Fix whitespaces for doctest * Revert unintended whitespace changes * Revert unintended white space changes #2 * Fix typos and grammar * Fix white space in doc * Streamline test code * Remove rebase artifacts * Fix PR link in doc * Fix test_ranking * Fix rebase errors * Fix import * Bring back newlines - Swallowed by copy/paste * Remove uncited ref link * Remove matplotlib deprecation warning * Bring back hidden reference * Add motivation to DET example * Fix lint * Add citation * Use modern matplotlib API Co-authored-by: Jeremy Karnowski <jeremy.karnowski@gmail.com> Co-authored-by: Julien Cornebise <julien@cornebise.com> Co-authored-by: Daniel Mohns <daniel.mohns@zenguard.org>
Strategy to forward prefix parameter when creating blocks

Main PR: scikit-learn#8363
I have some initial plots comparing different scalers in the california housing dataset...
I binarized the targets to help visualize it better...
For quantile normalizer
@glemaitre @tguillemot @ogrisel @dengemann @jnothman