[MRG] tfidfvectorizer documentation by blooraspberry · Pull Request #12204 · scikit-learn/scikit-learn

blooraspberry · 2018-09-29T16:43:04Z

closes #6766 and closes #9369

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This adds more information in the TfidfVectorizer documentation. It now includes comments about CountVectorizer and TfidfTransformer.

Any other comments?

amueller · 2018-09-29T16:47:58Z

sklearn/feature_extraction/text.py

+
+    CountVectorizer converts a collection of text documents to a matrix of token counts. 
+
+    TfidfTransformer then converts the count matrix from CountVectorizer to a normalized tf-idf representation. Tf is term frequency, and idf is inverse document frequency. This is a common way to calculate the count of a word relative to the appearance of a ducument. 


Can you make sure to break the lines please?

amueller · 2018-09-29T16:48:23Z

sklearn/feature_extraction/text.py

+    TfidfTransformer then converts the count matrix from CountVectorizer to a normalized tf-idf representation. Tf is term frequency, and idf is inverse document frequency. This is a common way to calculate the count of a word relative to the appearance of a ducument. 
+
+    The formula that is used to compute the tf-idf of term t is
+    tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as


I think the other documentation has some formatting for these, can you make sure to copy the code, not the rendering?

amueller · 2018-09-29T17:49:06Z

Can you please reference the issues and PRs this is addressing in the description? Then merging this will close these.

amueller

looks good.

qinhanmin2014

@blooraspberry I've restarted Travis for you and there're flake8 errors. Please correct them according to https://travis-ci.org/scikit-learn/scikit-learn/jobs/435054659

qinhanmin2014 · 2018-09-30T07:32:54Z

And I guess this closes #6766 and #9369

NicolasHug · 2018-10-02T13:47:48Z

Travis output is unreadable, so here is what needs to be fixed in text.py:

~/dev/sklearn(branch:pr/12212*) » flake8 sklearn/feature_extraction/text.py
sklearn/feature_extraction/text.py:240:9: E731 do not assign a lambda expression, use a def
sklearn/feature_extraction/text.py:1289:64: W291 trailing whitespace
sklearn/feature_extraction/text.py:1291:75: W291 trailing whitespace
sklearn/feature_extraction/text.py:1292:18: W291 trailing whitespace
sklearn/feature_extraction/text.py:1294:78: W291 trailing whitespace
sklearn/feature_extraction/text.py:1295:79: W291 trailing whitespace
sklearn/feature_extraction/text.py:1296:78: W291 trailing whitespace
sklearn/feature_extraction/text.py:1297:46: W291 trailing whitespace

sergulaydore · 2018-11-11T16:29:01Z

Hello @blooraspberry ,

Thank you for participating in the WiMLDS/scikit sprint. We would love to merge all the PRs that were submitted. It would be great if you could follow up on the work that you started! For the PR you submitted, would you please update and re-submit? Please include #wimlds in your PR conversation.

Any questions:

see workflow for reference
ask on this PR conversation or the issue tracker
ask on wimlds gitter with a reference to this PR

cc: @reshamas

blooraspberry · 2018-11-29T23:49:48Z

Hi Sergul, Sorry I just saw this email -- didn't realize my github is connected to another email account. I'll take a look soon. Sharon

…

On Sun, Nov 11, 2018 at 11:29 AM Sergul Aydore ***@***.***> wrote: Hello @blooraspberry <https://github.com/blooraspberry> , Thank you for participating in the WiMLDS/scikit sprint. We would love to merge all the PRs that were submitted. It would be great if you could follow up on the work that you started! For the PR you submitted, would you please update and re-submit? Please include #wimlds in your PR conversation. Any questions: - see workflow <https://github.com/WiMLDS/nyc-2018-scikit-sprint/blob/master/2_contributing_workflow.md> for reference - ask on this PR conversation or the issue tracker - ask on wimlds gitter <https://gitter.im/scikit-learn/wimlds> with a reference to this PR cc: @reshamas <https://github.com/reshamas> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12204 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQRDPCZ6Lbg7xfuDOTEPYeAq2fTiWU0_ks5uuFB8gaJpZM4XAlZG> .

reshamas · 2018-12-16T15:54:17Z

@blooraspberry
Will you be completing this PR?

reshamas · 2018-12-18T04:14:37Z

I am working on this PR.

tfidfvectorizer documentation

c15d5c8

amueller reviewed Sep 29, 2018

View reviewed changes

adding line breaks

93ae2d1

amueller approved these changes Sep 29, 2018

View reviewed changes

NicolasHug mentioned this pull request Sep 29, 2018

[MRG] added class_ to the LogisticRegression documentation #12212

Closed

jnothman approved these changes Sep 30, 2018

View reviewed changes

qinhanmin2014 approved these changes Sep 30, 2018

View reviewed changes

This was referenced Dec 18, 2018

improve TfidfVectorizer documentation #12811

Closed

[MRG] improve tfidfvectorizer documentation #12822

Merged

jnothman closed this in #12822 Jan 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MRG] tfidfvectorizer documentation#12204

[MRG] tfidfvectorizer documentation#12204
blooraspberry wants to merge 2 commits intoscikit-learn:masterfrom
blooraspberry:tfid_stuff_ST

blooraspberry commented Sep 29, 2018 •

edited by jnothman

Loading

Uh oh!

amueller Sep 29, 2018

Uh oh!

amueller Sep 29, 2018

Uh oh!

blooraspberry Sep 29, 2018

Uh oh!

amueller commented Sep 29, 2018

Uh oh!

amueller left a comment

Uh oh!

qinhanmin2014 left a comment •

edited

Loading

Uh oh!

qinhanmin2014 commented Sep 30, 2018

Uh oh!

NicolasHug commented Oct 2, 2018 •

edited

Loading

Uh oh!

sergulaydore commented Nov 11, 2018

Uh oh!

blooraspberry commented Nov 29, 2018 via email

Uh oh!

reshamas commented Dec 16, 2018

Uh oh!

reshamas commented Dec 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants


		CountVectorizer converts a collection of text documents to a matrix of token counts.

		TfidfTransformer then converts the count matrix from CountVectorizer to a normalized tf-idf representation. Tf is term frequency, and idf is inverse document frequency. This is a common way to calculate the count of a word relative to the appearance of a ducument.

Uh oh!

Conversation

blooraspberry commented Sep 29, 2018 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

amueller Sep 29, 2018

Choose a reason for hiding this comment

Uh oh!

amueller Sep 29, 2018

Choose a reason for hiding this comment

Uh oh!

blooraspberry Sep 29, 2018

Choose a reason for hiding this comment

Uh oh!

amueller commented Sep 29, 2018

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Sep 30, 2018

Uh oh!

NicolasHug commented Oct 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergulaydore commented Nov 11, 2018

Uh oh!

blooraspberry commented Nov 29, 2018 via email

Uh oh!

reshamas commented Dec 16, 2018

Uh oh!

reshamas commented Dec 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

blooraspberry commented Sep 29, 2018 •

edited by jnothman

Loading

qinhanmin2014 left a comment •

edited

Loading

NicolasHug commented Oct 2, 2018 •

edited

Loading