Conversation
Codecov Report
@@ Coverage Diff @@
## master #527 +/- ##
==========================================
+ Coverage 99.49% 99.50% +0.01%
==========================================
Files 80 82 +2
Lines 5340 5458 +118
==========================================
+ Hits 5313 5431 +118
Misses 27 27
Continue to review full report at Codecov.
|
|
It's working now (at least for the tfidf backend) but pretty slow - at least an order of magnitude slower than the Snowball analyzer. I think some batching must be used to make it more efficient, but that requires changes to the Analyzer API as well as to individual backends. |
|
Kudos, SonarCloud Quality Gate passed!
|
f3cf264 to
ffab9ea
Compare
|
Rebased and force-pushed. There is a new release of spaCy (3.2.0) available, should test that. |
|
Kudos, SonarCloud Quality Gate passed!
|
f4c8ea9 to
10463f1
Compare
|
Rebased on current master, fixed conflicts and force-pushed. |
… image via build arg
|
Ready for review! Things I'm a bit unsure about:
|
Works with YAKE, and Spacy analyzer also somewhat improved evaluation results compared to Snowball analyzer (on JYU test set F1@5 0.1706 -> 0.1870). Dockerfile looks good to me. (The three tries for timeouts in downloading ntlk data was originally added for builds by/in Drone, as there were some network problems in Drone at that time, but I think the situation has improved now.) Just one point to consider: if Spacy model has not been loaded, a bit lengthy traceback is shown ending |
|
I compared this to Snowball using the Annif-tutorial yso-nlf data sets and the three backend configurations (tfidf, mllm, omikuji-parabel) used in the tutorial. Results (best score for each backend type highlighted):
Observations:
The point of these experiments was to check that the analyzer works reasonably well with those backends, not that the results are necessarily better in terms of F1 scores etc. spaCy has other advantages, especially the many languages it supports. |
|
I also tested svc and fasttext backends using the 20news data set in Annif-corpora. Results:
Observations:
I'd say this is good enough, I will check a few final things (including the error shown when a model doesn't exist, thanks @juhoinkinen!) and then merge this. |
|
Kudos, SonarCloud Quality Gate passed!
|








Initial draft PR of new spaCy based (optional) analyzer.
Fixes #374
TODO items:
Test with Swedish (which doesn't have a complete pretrained model) and adapt the code as necessaryOut of scope for now