Count vectorizer (with Actors) by TomAugspurger · Pull Request #705 · dask/dask-ml

TomAugspurger · 2020-07-22T15:57:27Z

This has an implementation of CountVectorizer.

The primary difficulty is learning the vocabulary from the data. The basic idea is to just use scikit-learn's CountVectorizer on each partition of a bag. That'll give us a list of vocabularies (dictionaries) that we need to merge.

A naive implementation would just pass the vocabularies around as a dictionaries in the task graph. This isn't great since these can get pretty large. For the news dataset, scikit-learn learns a vocabulary that's 18 MB.

I've prototyped something using distributed's Actors (cc @mrocklin if you're interested in seeing them in use). Timings:

Actors?	fit	compute
Yes	2.3	3.4
No	5.2	4.2

I'll probably spend a bit more time to make this work without actors again. Regardless, I'll be writing up an example / blog post.

Closes #689

mrocklin · 2020-07-22T16:00:03Z

tests/feature_extraction/test_text.py

+def test_count_vectorizer():
+    # TODO: gen_cluster, pickle futures, issue.
+    from distributed import Client
+
+    with Client():


Either gen_cluster (recommended) or the c, s, a, b fixtures in utils_test which support the sync API

dask_ml/feature_extraction/text.py

TomAugspurger · 2020-07-24T15:38:33Z

Ok, I've removed the use of actors. They might provide slightly faster performance in some cases, but once I was using regular old dask correctly (not moving data around unnecessarily), the difference is negligible for my test case.

I'll probably write this up in a blog post, just a reminder of how thinking about where your data is relative to your compuation can be important for performance.

mrocklin · 2020-07-24T16:19:18Z

Cool. I'm glad to hear that you were able to find a high performing solution that was also simple.

…

On Fri, Jul 24, 2020 at 8:38 AM Tom Augspurger ***@***.***> wrote: Ok, I've removed the use of actors. They might provide slightly faster performance in some cases, but once I was using regular old dask correctly (not moving data around unnecessarily), the difference is negligible for my test case. I'll probably write this up in a blog post, just a reminder of how thinking about where your data is relative to your compuation can be important for performance. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#705 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCHRW575ZMI7YBJDWTR5GTIVANCNFSM4PE3HLVQ> .

…actor

TomAugspurger added 3 commits July 22, 2020 09:56

wip

3f83164

wip-actor

f477b21

basic test

5ada1dd

mrocklin reviewed Jul 22, 2020

View reviewed changes

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

TomAugspurger added 4 commits July 22, 2020 13:45

try

03112d7

toggle

2f41c2a

toggle

62ed07a

scatter

479d6e8

mrocklin reviewed Jul 22, 2020

View reviewed changes

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

mrocklin reviewed Jul 22, 2020

View reviewed changes

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

mrocklin reviewed Jul 22, 2020

View reviewed changes

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

mrocklin reviewed Jul 22, 2020

View reviewed changes

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

TomAugspurger added 4 commits July 23, 2020 09:08

fixups

1225f57

wip

29d0377

no actor

392b3ad

remove actor

f2ecd8b

TomAugspurger added 2 commits July 24, 2020 10:44

docs

ee309c4

fixup

c5a93a4

TomAugspurger added 2 commits July 24, 2020 14:32

Merge remote-tracking branch 'upstream/master' into count-vectorizer-…

00e9257

…actor

fixup

eaca917

TomAugspurger merged commit 236e13a into dask:master Jul 24, 2020

TomAugspurger deleted the count-vectorizer-actor branch July 24, 2020 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Count vectorizer (with Actors)#705

Count vectorizer (with Actors)#705
TomAugspurger merged 15 commits intodask:masterfrom
TomAugspurger:count-vectorizer-actor

TomAugspurger commented Jul 22, 2020 •

edited

Loading

Uh oh!

mrocklin Jul 22, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Jul 24, 2020

Uh oh!

mrocklin commented Jul 24, 2020 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TomAugspurger commented Jul 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin Jul 22, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Jul 24, 2020

Uh oh!

mrocklin commented Jul 24, 2020 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TomAugspurger commented Jul 22, 2020 •

edited

Loading