Implement multiprocessing speedup for `suggest` in absence of python-Levenshtein

**NOTE:** With the deprecation of the `python-Levenshtein` speedup for the `suggest` functionality (see #211 & #218), identifying other methods to increase performance is a priority. This multi-processing based approach is the best one I've thought of so far. If anyone has another suggestion, please open a new issue to discuss it.

----


Very early POC on [suggest-multiproc](https://github.com/bskinn/sphobjinv/tree/suggest-multiproc) branch.

Suggests some speedup possible, but not a slam-dunk whether it's worth it given the sub-2s processing times for most inventories. Properly exploiting `difflib.SequenceMatcher`'s caching behavior may change this, however.

For comparison, python-Levenshtein is a better speedup for less internal complexity, and doesn't appear to benefit *at all* from multiproc.

```
Latitude laptop (4 cores), yt (17420 objects)

Quick, rough timing:

>>> import sphobjinv as soi
>>> import timeit
>>> timeit.timeit("soi.Inventory('tests/resource/objects_yt.inv').suggest('ndarray')", globals=globals(), number=1)

NO mp,      WITH lev:  2.177 s
WITH mp(4), WITH lev:  2.396 s
WITH mp(6), WITH lev:  2.355 s

NO mp,      NO lev:   11.795 s
WITH mp(2), NO lev:   10.361 s
WITH mp(3), NO lev:    8.471 s
WITH mp(4), NO lev:    7.583 s (~35% speedup)
WITH mp(5), NO lev:    7.399 s (oversubscribed)
WITH mp(6), NO lev:    8.372 s (oversubscribed)
```

----

*Notes*

- Probably can switch to `pool.map()`, without the context manager
- Looks like using `multiprocessing.cpu_count()` would be a reasonable default pool size
- Want user to be able to set max of process count (*probably* refuse to oversubscribe??)
  - API -- arg to `Inventory.suggest()`, likely
  - CLI -- new parameter for `suggest` subparser
- If `nproc == 1` then skip `multiprocessing` entirely
- Probably need to check at `sphobjinv` import-time whether `multiprocessing` is available
  - Doing that in a way that doesn't further slow down the import time may be tricky
- Per the [`difflib` docs](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.set_seqs), implementing a bespoke scoring function directly with `difflib.SequenceMatcher` may allow *significant* speed gains, due to the ability to cache the `suggest` search term
- For v2.3, perhaps keep the current single-process fuzzywuzzy method as the default, but implement with a feature flag to switch to the mp-accelerated version?
- A naive, direct use of `difflib.SequenceMatcher` does not give good suggestions for some searches (e.g., "dataobj" in the sphobjinv inventory), compared to the default, `WRatio`-based partial-matcher in `fuzzywuzzy`
  - Going to want to port this fuzzywuzzy method directly into the codebase, adapted to allow for multiprocess, I think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement multiprocessing speedup for `suggest` in absence of python-Levenshtein #178

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Implement multiprocessing speedup for suggest in absence of python-Levenshtein #178

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Implement multiprocessing speedup for `suggest` in absence of python-Levenshtein #178