Add optional spacy_model argument to compute_lda_model in order to avoid expensive spacy model reloads

Open alessandrodd opened this issue 3 years ago • 0 comments

Currently, the compute_lda_model function makes N calls to the LoadFile.load_document method, one for each document in the dataset. The load_document method accepts an optional spacy_model, which is never specified by compute_lda_model.

This means that in turn load_document calls RawTextReader.read also without specifying an optional spacy_model argument, which finally leads to loading from scratch one of the installed spacy models at runtime N times. This obviously makes the whole LDA model computation significantly slower than necessary.

In my commit, I've added an optional spacy_model argument to the compute_lda_model function, which gets passed down to the load_document -> read chain, as well as a couple of additional logging lines to avoid giving the impression that the code is stuck.

On my machine, with a dataset of 14k documents, the LDA model computation time goes from ~2.5 days to ~20 minutes.

May 14 '22 08:05 alessandrodd