Skip to content

MemoryError when pruning vectors with small batch size #2976

@ALSchwalm

Description

@ALSchwalm

How to reproduce the behaviour

import spacy
nlp = spacy.load('en_core_web_lg')
nlp.vocab.prune_vectors(500000, 100)

The usage of the 'batch_size' parameter in the above code should avoid the memory constraints. However, a bug in vocab.pyx where the batch_size parameter is not passed on to the most_similar call (indeed, it is not used at all), causes the batch matrix to be very large when nr_row is large.

I can submit a PR to fix this, but I am not clear what an appropriate test would look like. I could create a large vocabulary and prune it to a slightly smaller size using a small batch size, but it would take a very long time to run, and would not necessarily fail even without the fix (if the machine running the test had lots of RAM, for example). Any advice there would be appreciated.

Your Environment

  • spaCy version: 2.0.16
  • Platform: Linux-4.18.16-arch1-1-ARCH-x86_64-with-arch
  • Python version: 3.7.1
  • Models: en_core_web_lg

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugBugs and behaviour differing from documentationfeat / vectorsFeature: Word vectors and similarity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions