[MRG] apply sparse_threshold even if all columns are sparse#12304
[MRG] apply sparse_threshold even if all columns are sparse#12304jnothman merged 5 commits intoscikit-learn:masterfrom
Conversation
jnothman
left a comment
There was a problem hiding this comment.
Please update the docstring and what's new
|
done |
| When the transformed output consists of all sparse or all dense data, | ||
| the stacked result will be sparse or dense, respectively, and this | ||
| keyword will be ignored. | ||
| If the transformed output contains sparse matrices, it will be stacked |
There was a problem hiding this comment.
Would "If the output of the different transformers contains sparse matrices, ..." be clearer? Because now if I start reading this, I would interpret "transformed output" as the output of the ColumnTransformer (so already stacked), and then the rest of the sentence reads a bit strange.
There was a problem hiding this comment.
yeah I wasn't happy with that wording either. I like yours.
| as a sparse matrix if the density is lower than this value. Use | ||
| ``sparse_threshold=0`` to always return dense. When the transformed | ||
| output consists of all dense data, the stacked result will be dense, | ||
| and this keyword will be ignored. |
There was a problem hiding this comment.
Should we also lift this special case? (the case of all dense data)
In principle for the default of 0.3, this will always be the case anyway (you would need to set the threshold to something bigger than 1 to always have sparse)
There was a problem hiding this comment.
So you're saying we should consider a way to force dense output into sparse? I think we can leave that feature till later, as currently threshold > 1 is not valid.
There was a problem hiding this comment.
I feel like this case would be much rarer? When would you want that?
There was a problem hiding this comment.
I don't think I would ever want that, I was just wondering for simplicity's sake (to not have special cases). But maybe it is so logical to keep all dense as dense, that actually strictly following the sparse_threshold would actually be less "normal".
So you can ignore my comment :)
There was a problem hiding this comment.
LGTM. It would not have hurt to add some validation about sparse_threshold being in [0, 1] and make it explicit in the docstring as currently it's not very clear (but that could be done in a different PR).
Waiting a few days before merging in case Joel has other comments on this.
jnothman
left a comment
There was a problem hiding this comment.
Though yes, validation would be nice
…se (scikit-learn#12304)" This reverts commit 1dc7cc0.
…se (scikit-learn#12304)" This reverts commit 1dc7cc0.
Fixes #12150