-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
MemoryError when pruning vectors with small batch size #2976
Copy link
Copy link
Closed
Labels
bugBugs and behaviour differing from documentationBugs and behaviour differing from documentationfeat / vectorsFeature: Word vectors and similarityFeature: Word vectors and similarity
Description
How to reproduce the behaviour
import spacy
nlp = spacy.load('en_core_web_lg')
nlp.vocab.prune_vectors(500000, 100)
The usage of the 'batch_size' parameter in the above code should avoid the memory constraints. However, a bug in vocab.pyx where the batch_size parameter is not passed on to the most_similar call (indeed, it is not used at all), causes the batch matrix to be very large when nr_row is large.
I can submit a PR to fix this, but I am not clear what an appropriate test would look like. I could create a large vocabulary and prune it to a slightly smaller size using a small batch size, but it would take a very long time to run, and would not necessarily fail even without the fix (if the machine running the test had lots of RAM, for example). Any advice there would be appreciated.
Your Environment
- spaCy version: 2.0.16
- Platform: Linux-4.18.16-arch1-1-ARCH-x86_64-with-arch
- Python version: 3.7.1
- Models: en_core_web_lg
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugBugs and behaviour differing from documentationBugs and behaviour differing from documentationfeat / vectorsFeature: Word vectors and similarityFeature: Word vectors and similarity