Conversation
5bdbf64 to
d82a54a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #926 +/- ##
==========================================
- Coverage 99.63% 99.63% -0.01%
==========================================
Files 103 103
Lines 8238 8225 -13
==========================================
- Hits 8208 8195 -13
Misses 30 30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d82a54a to
da479eb
Compare
|
Selecting of the PyTorch variant (CPU or CUDA x.y or ROCm or...) when setting up the development environment using The problem is that But fortunately, it is possible to have some degree of control over the resolution by setting up "extras" and then declaring a "conflict" between them. This causes uv to "fork" the resolution into different "branches", each having their own dependency tree. So in commit e629963, I added two new extras: The end result is that these two extras can be used to select the PyTorch variant at Here are examples of how this works now: 1.
|
|
I refined the above solution by adding an Maybe not ideal, but it works. |
|
I ran benchmarking runs using Annif-tutorial YSO-NLF dataset on annif-data-kk server (it has 6 CPUs). The used script and output data are in the benchmarking branch train
eval
Compared to TensorFlow implementation PyTorch requires twice as much memory in training and is slightly slower (107% in usertime); but in inference the situation is the opposite: PyTorch is faster (~98%) and takes less memory. |
|
Thanks @juhoinkinen ! The RAM usage doubling is interesting. First hypothesis: Maybe PT uses higher precision floats than TF? I'll investigate. |
|
I've now implemented all the changes I had in mind. The model has been changed to a much smaller variant that seems to perform better than the original TensorFlow based model (which turns out was heavily overengineered, my bad!). @juhoinkinen @mjsuhonos @mfakaehler Please try this out if you have a chance! If there are no big problems then I think this could soon be merged. |
|
Using the code before the today's commits I ran the same benchmarks as above, full output here. (The early stopping is a very nice feature! Decrease of nDCG stopped training after epoch 23 (with -j1) and 22 (with -j6).) train
eval
|
juhoinkinen
left a comment
There was a problem hiding this comment.
I read through the code and have no complains. :)
There could be a short explanation of the model architecture somewhere, maybe in the NN ensemble Wiki page.
We could try out if online learning works better with the new implementation.
|
@juhoinkinen Thanks for the new benchmark and the approval! It's a pity that NDCG seems to have decreased. With the Finto AI (Finnish) training data set that I used for most tests, there was a nice increase in F1@5 scores of more than 0.02. You reported that the model size is almost unchanged, but this number seems to include the LMDB which contains all the preprocessed training data. The actual model file ( |
|
I did some further testing on YSO and KOKO based projects. Sadly, the evaluation results have in most cases decreased compared to the old NN ensemble (around 0.02 in F1@5 scores). I think the model architecture still needs some work - the current one may be too simple after all, even if it performs better than the old one in certain cases (e.g. JYU-theses/fi). Also, the LMDB size now seems to grow faster than earlier. All the KOKO models required more than 2GB disk space, when previously they didn't hit the default 1GB limit. This needs to be investigated - it was not intended. |
|
Sandro and I have tested the new nn-ensemble backend. As a small disclaimer ahead: even with the old implementation, we have not yet found a configuration that has produced better results then plain ensemble. Maybe the GND with its >200K entities is too hard a problem for the architecture. We tested the following config with the new and the old branches and trained the ensemble on ~250K tables of content. projects.cfgWhile the tensorflow implementation ran through and produced reasonable results, the pytorch version had serious trouble with the LMDB and aborted during document processing: resulting in the following error: `lmdb.MapFullError: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached`We restarted the process increasing LMDB size to 70GB and 400GB, both eventually aborting. Only after restricting the training documents to 25K (1/10 of the original), the actual pytorch training started and the process could finish. The final size of the file We will report any other observations later. |
|
Here are the further results from a test that @mfakaehler and I carried out. Since the training data set of 260K could not be used (see above), we limited the training data to 25K. SettingsTest case settings Singlemodells nn-ensemble parameters nn classic (Keras/TensorFlow)nn new PyTorchTechnical settings ResultsBelow are some observations and analysis of the results. Information on memory and CPU usage can be found in the graphs and figures in Appendix 926-test-nn-cpu-memory.pdf train (25.000 tocs)
The mdb-file sizes differ enormously: despite the same training volume, the new nn PyTorch grows many times faster than nn classic. The model-files show quite the opposite behaviour. Note: The new nn PyTorch ensemble has stopped after the 6th epoch due to the early stopping functionality. Message: A suggestion for the early stopping feature: Depending on the size of the vocabulary and, in particular, the training data set, it might be useful to have an option to individually specify the size of the random subset used (currently n=512 documents). In the DNB test case with a training data set of 25,000 documents, a higher number (e.g., 20% of the total) could be useful and lead to more robust statements. So you could either set EVAL_BATCH_SIZE proportional to the training data size, or make it configurable with the annif train call. eval (40.885 tocs)
The new nn PyTorch is 2 minutes faster in evaluation than nn classic. There is hardly any difference in performance: nn PyTorch and nn classic achieve (almost) identical values for F1@5. For F1@10, nn classic is slightly ahead, while the new nn PyTorch offers a slightly more optimized ranking for the top 10 suggestions (see NDCG@10). index (40.885 tocs)
The nn classic requires 0.097 seconds per document. The new nn PyTorch requires 0.156 seconds per document. |
|
Thanks a lot for the detailed report @san-uh ! It is clear that there is something wrong with the LMDB database (nn-train.mdb) growing much faster than before. That obviously needs fixing. After that, I will try to rework the model architecture to try to achieve better results with at least some of our data sets (e.g. KOKO based) where the outcome was much worse than for the old NN ensemble. Regarding the early stopping heuristic and the set of 512 documents that is used to calculate the metric: you are right that this is a bit rigid. However, in my tests it didn't seem to matter that much - it might happen that choosing a non-ideal subset would cause the early stopping to happen one epoch earlier or later than ideal, but it should be pretty close to optimal regardless. However, I can try to make this more flexible. |
|
|
It took a while to hunt this down, but I found the cause of the massive increase in LMDB size. The code was using a suboptimal sparse matrix type (CSC instead of CSR). CSC used to be the right choice in the old code, but the new code flips some matrix dimensions around and now CSR is needed. The most recent commit 211fde9 fixes this. The next step is further iteration on the model itself; I think some alternative approaches must be evaluated and the current one may bee too simple. |



This PR reimplements the NN ensemble using PyTorch instead of Keras/TensorFlow.
To test this, you will have to use
uv sync --group all --extra torch-cpuor similar (see comments below).Some notes about the implementation:
top_k_categorical_accuracy, but this was not easily available in PyTorch, so I switched to using the nDCG metric computed for a random subset (n=512 documents) of the given train set and this metric is used for early stoppingmax-epochsparameter), but tracks nDCG on a small sample (n=512) of the train set and stops when scores start to decline (with patience=2)allfor installing all extras (a substitute for--all-extraswhich won't work anymore) as well as special extras for selecting the PyTorch variant. I only definedtorch-cpuandtorch-cu128extras for now, but I think the setup could quite easily be extended to other PyTorch variants such as CUDA 12.6 or 13.0, ROCm or Intel XPU, though obviously these would require more configuration inpyproject.toml.Fixes #895