Count vectorizer (with Actors)#705
Merged
TomAugspurger merged 15 commits intodask:masterfrom Jul 24, 2020
Merged
Conversation
mrocklin
reviewed
Jul 22, 2020
Comment on lines
+112
to
+116
| def test_count_vectorizer(): | ||
| # TODO: gen_cluster, pickle futures, issue. | ||
| from distributed import Client | ||
|
|
||
| with Client(): |
Member
There was a problem hiding this comment.
Either gen_cluster (recommended) or the c, s, a, b fixtures in utils_test which support the sync API
mrocklin
reviewed
Jul 22, 2020
mrocklin
reviewed
Jul 22, 2020
mrocklin
reviewed
Jul 22, 2020
mrocklin
reviewed
Jul 22, 2020
mrocklin
reviewed
Jul 22, 2020
Member
Author
|
Ok, I've removed the use of actors. They might provide slightly faster performance in some cases, but once I was using regular old dask correctly (not moving data around unnecessarily), the difference is negligible for my test case. I'll probably write this up in a blog post, just a reminder of how thinking about where your data is relative to your compuation can be important for performance. |
Member
|
Cool. I'm glad to hear that you were able to find a high performing
solution that was also simple.
…On Fri, Jul 24, 2020 at 8:38 AM Tom Augspurger ***@***.***> wrote:
Ok, I've removed the use of actors. They might provide slightly faster
performance in some cases, but once I was using regular old dask correctly
(not moving data around unnecessarily), the difference is negligible for my
test case.
I'll probably write this up in a blog post, just a reminder of how
thinking about where your data is relative to your compuation can be
important for performance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#705 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCHRW575ZMI7YBJDWTR5GTIVANCNFSM4PE3HLVQ>
.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This has an implementation of CountVectorizer.
The primary difficulty is learning the vocabulary from the data. The basic idea is to just use scikit-learn's CountVectorizer on each partition of a bag. That'll give us a list of vocabularies (dictionaries) that we need to merge.
A naive implementation would just pass the
vocabulariesaround as a dictionaries in the task graph. This isn't great since these can get pretty large. For the news dataset, scikit-learn learns a vocabulary that's 18 MB.I've prototyped something using distributed's Actors (cc @mrocklin if you're interested in seeing them in use). Timings:
I'll probably spend a bit more time to make this work without actors again. Regardless, I'll be writing up an example / blog post.
Closes #689