[MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings#9369
[MRG+1] Edited TfidfVectorizer and TfidfTransformer docstrings#9369musiciancodes wants to merge 4 commits intoscikit-learn:masterfrom musiciancodes:tfidf_docs
Conversation
|
This seems like an improvement to me, but @amueller will have to speak as to if it addresses all the concerns he brought up. |
| that contain term t. The effect of adding "1" to the idf in the equation | ||
| above is that terms with zero idf, i.e., terms that occur in all documents | ||
|
|
||
| tf-idf(d, t, D) = tf(t, d) * idf(t, D) |
There was a problem hiding this comment.
Can you maybe reproduce the formula also in the TfidfVectorizer docs? Otherwise looks good :) I feel like the two should largely have the same docs.
sklearn/feature_extraction/text.py
Outdated
| Normalization is "c" (cosine) when ``norm='l2'``, "n" (none) | ||
| when ``norm=None``. | ||
|
|
||
| By default, the l2 norm is used. |
There was a problem hiding this comment.
This should be at the explanation of the norm parameter, the first line should end with something like (default='l2')
There was a problem hiding this comment.
I couldn't decide whether or not to go down a rabbit hole of l2 vs. l1 norm, partly because my first pass at a google search didn't reveal the most cogent explanations; feels out of scope for a docstring, though?
There was a problem hiding this comment.
yeah I mostly wanted the clarification at the explanation of the parameter.
sklearn/feature_extraction/text.py
Outdated
| Tf is "n" (natural) by default, "l" (logarithmic) when | ||
| ``sublinear_tf=True``. | ||
| Idf is "t" when use_idf is given, "n" (none) otherwise. | ||
| The "norm" parameter provides the "how" of how each input vector is |
There was a problem hiding this comment.
I feel this is redundant here as it is explained below, right?
There was a problem hiding this comment.
It is - I was trying to address it as it seemed like it was one of the points of discussion in the previous issue. Since norm discussed as a parameter, though, it makes more sense to discuss the default there.
|
LGTM, maybe you should do |
jnothman
left a comment
There was a problem hiding this comment.
I feel like is too verose and detailed for a docstring. We have the user guide links to provide such detail, and that particularly helps us avoid maintaining duplicate documentation!
Rather this should be limited to saying that tfidf reweights vectors of term counts to emphasise terms that are frequent within a document and infrequent across the collection if documents. The rest in the user guide
|
|
||
| Inverse-document-frequency (idf) = 1 / (# of docs a term appears in) | ||
|
|
||
| This is a common term weighting scheme in information |
There was a problem hiding this comment.
Maybe "family of schemes". Really you mean that the product of a monotonic increasing function of TF and a monotonic increasing function of IDF constitutes a family of term weighting schemes. You miss the fact that we care about their product.
There was a problem hiding this comment.
Perhaps adding tfidf as the next definition makes sense. It may also be appropriate to format these definitions as a definition list.
Issue - docstring for TfidfVectorizer doesn't track TfidfTransformer
#6766
What does this implement/fix? Explain your changes.
I made the TfIdfVectorizer docstring a little bit more verbose about tf-idf, referred to TfidfTransformer, and edited both docstrings for clarity.