Skip to content

Count vectorizer (with Actors)#705

Merged
TomAugspurger merged 15 commits intodask:masterfrom
TomAugspurger:count-vectorizer-actor
Jul 24, 2020
Merged

Count vectorizer (with Actors)#705
TomAugspurger merged 15 commits intodask:masterfrom
TomAugspurger:count-vectorizer-actor

Conversation

@TomAugspurger
Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger commented Jul 22, 2020

This has an implementation of CountVectorizer.

The primary difficulty is learning the vocabulary from the data. The basic idea is to just use scikit-learn's CountVectorizer on each partition of a bag. That'll give us a list of vocabularies (dictionaries) that we need to merge.

A naive implementation would just pass the vocabularies around as a dictionaries in the task graph. This isn't great since these can get pretty large. For the news dataset, scikit-learn learns a vocabulary that's 18 MB.

I've prototyped something using distributed's Actors (cc @mrocklin if you're interested in seeing them in use). Timings:

Actors? fit compute
Yes 2.3 3.4
No 5.2 4.2

I'll probably spend a bit more time to make this work without actors again. Regardless, I'll be writing up an example / blog post.

Closes #689

Comment on lines +112 to +116
def test_count_vectorizer():
# TODO: gen_cluster, pickle futures, issue.
from distributed import Client

with Client():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either gen_cluster (recommended) or the c, s, a, b fixtures in utils_test which support the sync API

@TomAugspurger
Copy link
Copy Markdown
Member Author

Ok, I've removed the use of actors. They might provide slightly faster performance in some cases, but once I was using regular old dask correctly (not moving data around unnecessarily), the difference is negligible for my test case.

I'll probably write this up in a blog post, just a reminder of how thinking about where your data is relative to your compuation can be important for performance.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Jul 24, 2020 via email

@TomAugspurger TomAugspurger merged commit 236e13a into dask:master Jul 24, 2020
@TomAugspurger TomAugspurger deleted the count-vectorizer-actor branch July 24, 2020 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CountVectorizer for text preprocessing

2 participants