-
-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Labels
area: api 🤖area: cli 💻issue: planned ⌚Assigned to a specific version's milestoneAssigned to a specific version's milestonetopic: performance 🏎️type: enhancement ✨Something to addSomething to add
Milestone
Description
NOTE: With the deprecation of the python-Levenshtein speedup for the suggest functionality (see #211 & #218), identifying other methods to increase performance is a priority. This multi-processing based approach is the best one I've thought of so far. If anyone has another suggestion, please open a new issue to discuss it.
Very early POC on suggest-multiproc branch.
Suggests some speedup possible, but not a slam-dunk whether it's worth it given the sub-2s processing times for most inventories. Properly exploiting difflib.SequenceMatcher's caching behavior may change this, however.
For comparison, python-Levenshtein is a better speedup for less internal complexity, and doesn't appear to benefit at all from multiproc.
Latitude laptop (4 cores), yt (17420 objects)
Quick, rough timing:
>>> import sphobjinv as soi
>>> import timeit
>>> timeit.timeit("soi.Inventory('tests/resource/objects_yt.inv').suggest('ndarray')", globals=globals(), number=1)
NO mp, WITH lev: 2.177 s
WITH mp(4), WITH lev: 2.396 s
WITH mp(6), WITH lev: 2.355 s
NO mp, NO lev: 11.795 s
WITH mp(2), NO lev: 10.361 s
WITH mp(3), NO lev: 8.471 s
WITH mp(4), NO lev: 7.583 s (~35% speedup)
WITH mp(5), NO lev: 7.399 s (oversubscribed)
WITH mp(6), NO lev: 8.372 s (oversubscribed)
Notes
- Probably can switch to
pool.map(), without the context manager - Looks like using
multiprocessing.cpu_count()would be a reasonable default pool size - Want user to be able to set max of process count (probably refuse to oversubscribe??)
- API -- arg to
Inventory.suggest(), likely - CLI -- new parameter for
suggestsubparser
- API -- arg to
- If
nproc == 1then skipmultiprocessingentirely - Probably need to check at
sphobjinvimport-time whethermultiprocessingis available- Doing that in a way that doesn't further slow down the import time may be tricky
- Per the
difflibdocs, implementing a bespoke scoring function directly withdifflib.SequenceMatchermay allow significant speed gains, due to the ability to cache thesuggestsearch term - For v2.3, perhaps keep the current single-process fuzzywuzzy method as the default, but implement with a feature flag to switch to the mp-accelerated version?
- A naive, direct use of
difflib.SequenceMatcherdoes not give good suggestions for some searches (e.g., "dataobj" in the sphobjinv inventory), compared to the default,WRatio-based partial-matcher infuzzywuzzy- Going to want to port this fuzzywuzzy method directly into the codebase, adapted to allow for multiprocess, I think.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area: api 🤖area: cli 💻issue: planned ⌚Assigned to a specific version's milestoneAssigned to a specific version's milestonetopic: performance 🏎️type: enhancement ✨Something to addSomething to add