NOTE: With the deprecation of the python-Levenshtein speedup for the suggest functionality (see #211 & #218), identifying other methods to increase performance is a priority. This multi-processing based approach is the best one I've thought of so far. If anyone has another suggestion, please open a new issue to discuss it.
Very early POC on suggest-multiproc branch.
Suggests some speedup possible, but not a slam-dunk whether it's worth it given the sub-2s processing times for most inventories. Properly exploiting difflib.SequenceMatcher's caching behavior may change this, however.
For comparison, python-Levenshtein is a better speedup for less internal complexity, and doesn't appear to benefit at all from multiproc.
Latitude laptop (4 cores), yt (17420 objects)
Quick, rough timing:
>>> import sphobjinv as soi
>>> import timeit
>>> timeit.timeit("soi.Inventory('tests/resource/objects_yt.inv').suggest('ndarray')", globals=globals(), number=1)
NO mp, WITH lev: 2.177 s
WITH mp(4), WITH lev: 2.396 s
WITH mp(6), WITH lev: 2.355 s
NO mp, NO lev: 11.795 s
WITH mp(2), NO lev: 10.361 s
WITH mp(3), NO lev: 8.471 s
WITH mp(4), NO lev: 7.583 s (~35% speedup)
WITH mp(5), NO lev: 7.399 s (oversubscribed)
WITH mp(6), NO lev: 8.372 s (oversubscribed)
Notes
- Probably can switch to
pool.map(), without the context manager
- Looks like using
multiprocessing.cpu_count() would be a reasonable default pool size
- Want user to be able to set max of process count (probably refuse to oversubscribe??)
- API -- arg to
Inventory.suggest(), likely
- CLI -- new parameter for
suggest subparser
- If
nproc == 1 then skip multiprocessing entirely
- Probably need to check at
sphobjinv import-time whether multiprocessing is available
- Doing that in a way that doesn't further slow down the import time may be tricky
- Per the
difflib docs, implementing a bespoke scoring function directly with difflib.SequenceMatcher may allow significant speed gains, due to the ability to cache the suggest search term
- For v2.3, perhaps keep the current single-process fuzzywuzzy method as the default, but implement with a feature flag to switch to the mp-accelerated version?
- A naive, direct use of
difflib.SequenceMatcher does not give good suggestions for some searches (e.g., "dataobj" in the sphobjinv inventory), compared to the default, WRatio-based partial-matcher in fuzzywuzzy
- Going to want to port this fuzzywuzzy method directly into the codebase, adapted to allow for multiprocess, I think.
NOTE: With the deprecation of the
python-Levenshteinspeedup for thesuggestfunctionality (see #211 & #218), identifying other methods to increase performance is a priority. This multi-processing based approach is the best one I've thought of so far. If anyone has another suggestion, please open a new issue to discuss it.Very early POC on suggest-multiproc branch.
Suggests some speedup possible, but not a slam-dunk whether it's worth it given the sub-2s processing times for most inventories. Properly exploiting
difflib.SequenceMatcher's caching behavior may change this, however.For comparison, python-Levenshtein is a better speedup for less internal complexity, and doesn't appear to benefit at all from multiproc.
Notes
pool.map(), without the context managermultiprocessing.cpu_count()would be a reasonable default pool sizeInventory.suggest(), likelysuggestsubparsernproc == 1then skipmultiprocessingentirelysphobjinvimport-time whethermultiprocessingis availabledifflibdocs, implementing a bespoke scoring function directly withdifflib.SequenceMatchermay allow significant speed gains, due to the ability to cache thesuggestsearch termdifflib.SequenceMatcherdoes not give good suggestions for some searches (e.g., "dataobj" in the sphobjinv inventory), compared to the default,WRatio-based partial-matcher infuzzywuzzy