Skip to content

Support for token processor. Fixes #1156#1537

Closed
wrichert wants to merge 29 commits intoscikit-learn:masterfrom
wrichert:token-processor
Closed

Support for token processor. Fixes #1156#1537
wrichert wants to merge 29 commits intoscikit-learn:masterfrom
wrichert:token-processor

Conversation

@wrichert
Copy link
Copy Markdown

@wrichert wrichert commented Jan 8, 2013

No description provided.

@wrichert
Copy link
Copy Markdown
Author

wrichert commented Jan 8, 2013

In the current implementation, CountVectorizer keeps at least one copy per preprocessing step (preprocessing, tokenizing, etc.).

I think we could rewrite it using generators only, which would be more memory and performance friendly.

@amueller
Copy link
Copy Markdown
Member

amueller commented Jan 8, 2013

Hi @wrichert. Thanks for the PR. In general I think it looks good.
The test is in the wrong place, though.
The tests for the feature_extaction module are in features_extraction/tests.

@wrichert
Copy link
Copy Markdown
Author

wrichert commented Jan 8, 2013

Right. Will move it when I have again access to the repo.

@wrichert
Copy link
Copy Markdown
Author

wrichert commented Jan 9, 2013

Done.

@wrichert
Copy link
Copy Markdown
Author

@amueller Is there anything needed from my side to get this PR into the next release?

@amueller
Copy link
Copy Markdown
Member

@wrichert Sorry I have not that much time to review right now. Will try tonight. Maybe @ogrisel wants to have a look.

@amueller
Copy link
Copy Markdown
Member

Maybe a short example in form of a doctest in the narrative or docstring would be nice?

@wrichert
Copy link
Copy Markdown
Author

@amueller Any chance that this is included in 0.13?

@amueller
Copy link
Copy Markdown
Member

If you find two devs to review and merge it. I won't have time, sorry.

@amueller
Copy link
Copy Markdown
Member

I'll try to have a look soon. Sorry, I'm still pretty busy atm.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetics: please add a blank line above this line (PEP 257).

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Feb 7, 2013

Could you please rebase this branch on top of master (or merge master into it if you are not familiar with rebasing).

Then fix pep8 issues reported by: http://pypi.python.org/pypi/pep8

@wrichert
Copy link
Copy Markdown
Author

wrichert commented Feb 8, 2013

Sure. Will address your suggestions beginning of next week as I won't have time before that.

@wrichert
Copy link
Copy Markdown
Author

Hmm, I rebased the "Support for token processor. Fixes #1537" and resolved all the conflicts, but I'm not sure whether I did the right thing, seeing lots of other commits appearing in this threads.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Feb 28, 2013

Indeed please, start a new branch off the current master and cherry pick just the commits or files that are related to issue #1156 to make the review easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants