Skip to content

Reimplement NN ensemble using PyTorch#926

Open
osma wants to merge 42 commits intomainfrom
issue895-nn-ensemble-pytorch
Open

Reimplement NN ensemble using PyTorch#926
osma wants to merge 42 commits intomainfrom
issue895-nn-ensemble-pytorch

Conversation

@osma
Copy link
Copy Markdown
Member

@osma osma commented Jan 13, 2026

This PR reimplements the NN ensemble using PyTorch instead of Keras/TensorFlow.

To test this, you will have to use uv sync --group all --extra torch-cpu or similar (see comments below).

Some notes about the implementation:

  • the neural network architecture has been radically simplified; it turned out that a much simpler model (separate linear models for each concept) gives better results than the old MLP-based model
  • the old code displayed top_k_categorical_accuracy, but this was not easily available in PyTorch, so I switched to using the nDCG metric computed for a random subset (n=512 documents) of the given train set and this metric is used for early stopping
  • the progress bar shown during training now uses tdqm, so it looks a bit different than the Keras one; it is also displayed on stderr and not stdout as the old one used to be
  • the code implements early stopping; it could train up to 20 epochs (can be set with max-epochs parameter), but tracks nDCG on a small sample (n=512) of the train set and stops when scores start to decline (with patience=2)
  • the old code showed a detailed error message when model loading failed; I couldn't figure out (yet) how to do that with PyTorch models, but the model is stored with metadata (python version, torch version etc.) that may be helpful in implementing such an error message later on if it turns out to be necessary. In general, the models should be pretty much PyTorch-version-agnostic so there may not be a need for this.
  • This PR sets up a dependency group all for installing all extras (a substitute for --all-extras which won't work anymore) as well as special extras for selecting the PyTorch variant. I only defined torch-cpu and torch-cu128 extras for now, but I think the setup could quite easily be extended to other PyTorch variants such as CUDA 12.6 or 13.0, ROCm or Intel XPU, though obviously these would require more configuration in pyproject.toml.
  • This NN ensemble will not make use of a GPU anyway; the model is trained and inference is performed using CPU only. The model is so small that using GPU computation would not bring any practical benefit. But the infrastructure for GPU use is now in place for other PyTorch based backends such as EBM or XTransformer that would benefit from GPU computing.

Fixes #895

@osma osma self-assigned this Jan 13, 2026
@osma osma force-pushed the issue895-nn-ensemble-pytorch branch from 5bdbf64 to d82a54a Compare January 13, 2026 11:40
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.63%. Comparing base (4ab42dc) to head (211fde9).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #926      +/-   ##
==========================================
- Coverage   99.63%   99.63%   -0.01%     
==========================================
  Files         103      103              
  Lines        8238     8225      -13     
==========================================
- Hits         8208     8195      -13     
  Misses         30       30              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@osma osma force-pushed the issue895-nn-ensemble-pytorch branch from d82a54a to da479eb Compare January 15, 2026 12:25
@osma
Copy link
Copy Markdown
Member Author

osma commented Jan 15, 2026

Selecting of the PyTorch variant (CPU or CUDA x.y or ROCm or...) when setting up the development environment using uv sync has been a headache, but I think I've found a workable solution. It's not super elegant, but at least it seems to work.

The problem is that uv sync wants to perform "universal resolution", that is, resolve all the transitive dependencies once and for all, then write the result into the uv.lock file. This can be parameterized by OS, Python version and some other factors, but not by anything that the user could set when running uv sync. Since different PyTorch variants have different dependencies (e.g. CUDA libraries), dependencies for each of them would have to be resolved separately.

But fortunately, it is possible to have some degree of control over the resolution by setting up "extras" and then declaring a "conflict" between them. This causes uv to "fork" the resolution into different "branches", each having their own dependency tree.

So in commit e629963, I added two new extras: torch-cpu (CPU only) and torch-cu128 (CUDA 12.8 GPU), and declared a conflict between them, i.e., you can't install both extras at the same time. (This will unfortunately cause --all-extras to stop working, which is a shame, since it means that lots of specific --extra parameters are needed in typical situations.) These extras are then tied to specific PyTorch package indexes and thus different variants of the torch package.

The end result is that these two extras can be used to select the PyTorch variant at uv sync time. The torch dependency is still also defined for the nn extra, without a specific index. This means that installing only the nn extra will install whatever is the default PyTorch variant (on Linux it is a CUDA variant).

Here are examples of how this works now:

1. uv sync without extras

This installs 439MB of dependencies, no PyTorch.

$ uv sync
Resolved 212 packages in 1.71s
      Built annif @ file:///home/oisuomin/git/Annif
Prepared 1 package in 261ms
Uninstalled 1 package in 0.21ms
Installed 1 package in 0.50ms
 ~ annif==1.5.0.dev0 (from file:///home/oisuomin/git/Annif)

$ du -sh .venv
439M	.venv

2. uv sync with just the nn extra

This installs the default PyTorch CUDA variant, for a total 2.2GB of dependencies.

$ uv sync --extra nn
Resolved 212 packages in 0.77ms
Installed 6 packages in 96ms
 + lmdb==1.7.5
 + mpmath==1.3.0
 + networkx==3.6.1
 + setuptools==80.9.0
 + sympy==1.14.0
 + torch==2.9.1

$ du -sh .venv
2.2G	.venv

3. uv sync with both nn and torch-cpu extras

This switches to the CPU-only variant of PyTorch. Dependencies are now only 1.2GB.

$ uv sync --extra nn --extra torch-cpu
Resolved 212 packages in 0.78ms
Uninstalled 1 package in 69ms
Installed 1 package in 93ms
 - torch==2.9.1
 + torch==2.9.1+cpu

$ du -sh .venv
1.2G	.venv

4. uv sync with both nn and torch-cu128 extras

This installs the PyTorch CUDA 12.8 variant and lots of nvidia-* library packages, for a whopping 7.0GB of dependencies. (I wonder why this isn't the same as the default PyTorch CUDA build that got installed in step 2 above?)

$ uv sync --extra nn --extra torch-cu128
Resolved 212 packages in 0.77ms
Uninstalled 1 package in 72ms
Installed 17 packages in 97ms
 + nvidia-cublas-cu12==12.8.4.1
 + nvidia-cuda-cupti-cu12==12.8.90
 + nvidia-cuda-nvrtc-cu12==12.8.93
 + nvidia-cuda-runtime-cu12==12.8.90
 + nvidia-cudnn-cu12==9.10.2.21
 + nvidia-cufft-cu12==11.3.3.83
 + nvidia-cufile-cu12==1.13.1.3
 + nvidia-curand-cu12==10.3.9.90
 + nvidia-cusolver-cu12==11.7.3.90
 + nvidia-cusparse-cu12==12.5.8.93
 + nvidia-cusparselt-cu12==0.7.1
 + nvidia-nccl-cu12==2.27.5
 + nvidia-nvjitlink-cu12==12.8.93
 + nvidia-nvshmem-cu12==3.3.20
 + nvidia-nvtx-cu12==12.8.90
 - torch==2.9.1+cpu
 + torch==2.9.1+cu128
 + triton==3.5.1

$ du -sh .venv
7.0G	.venv

@osma osma mentioned this pull request Jan 15, 2026
@osma
Copy link
Copy Markdown
Member Author

osma commented Jan 16, 2026

I refined the above solution by adding an all dependency group (because --all-extras cannot be used anymore). Now a basic developer install with all CPU-only extra features can be installed with:

uv sync --group all --extra torch-cpu

Maybe not ideal, but it works.

@osma osma requested a review from juhoinkinen January 16, 2026 13:54
@osma osma added this to the 1.5 milestone Jan 16, 2026
@osma osma marked this pull request as ready for review January 16, 2026 15:54
@osma osma changed the title [WIP] Reimplement NN ensemble using PyTorch Reimplement NN ensemble using PyTorch Jan 16, 2026
@juhoinkinen
Copy link
Copy Markdown
Member

juhoinkinen commented Jan 22, 2026

I ran benchmarking runs using Annif-tutorial YSO-NLF dataset on annif-data-kk server (it has 6 CPUs).

The used script and output data are in the benchmarking branch

train

Before (main) -j1 After (this PR) -j1 Before (main) -j6 After (this PR) -j6
user time (seconds) 2810.63 3023.01 2948.25 3208.04
percent CPU 106% 112% 571% 538%
wall time 44:26.96 45:36.19 8:45.21 10:10.04
max RSS 3_368_876 7_076_980 2_599_604 6_764_364
model disk size (bytes) 1_304_759_580 1_131_495_858 (same as -j1) (same as -j1)

eval

Before (main) -j1 After (main) -j6 Before (this PR) -j1 After (this PR) -j6
user time 475.29 471.15 485.92 473.70
percent CPU 99% 99% 498% 507%
wall time 7:58.65 7:53.83 1:38.66 1:34.24
max RSS 2_666_460 2_176_184 2_105_688 1_840_860
nDCG 0.4805 0.4750 0.4775 0.4691

Compared to TensorFlow implementation PyTorch requires twice as much memory in training and is slightly slower (107% in usertime); but in inference the situation is the opposite: PyTorch is faster (~98%) and takes less memory.

@osma
Copy link
Copy Markdown
Member Author

osma commented Jan 22, 2026

Thanks @juhoinkinen ! The RAM usage doubling is interesting. First hypothesis: Maybe PT uses higher precision floats than TF? I'll investigate.

@osma
Copy link
Copy Markdown
Member Author

osma commented Feb 11, 2026

I've now implemented all the changes I had in mind. The model has been changed to a much smaller variant that seems to perform better than the original TensorFlow based model (which turns out was heavily overengineered, my bad!).

@juhoinkinen @mjsuhonos @mfakaehler Please try this out if you have a chance! If there are no big problems then I think this could soon be merged.

@juhoinkinen
Copy link
Copy Markdown
Member

Using the code before the today's commits I ran the same benchmarks as above, full output here.

(The early stopping is a very nice feature! Decrease of nDCG stopped training after epoch 23 (with -j1) and 22 (with -j6).)

train

Before (main) -j1 After (this PR) -j1 Before (main) -j6 After (this PR) -j6
user time (seconds) 2810.63 2976.53 2948.25 2992.71
percent CPU 106% 108% 571% 573%
wall time 44:26.96 46:39.13 8:45.21 8:56.57
max RSS 3_368_876 3_354_828 2_599_604 3_064_212
model disk size (bytes) 1_304_759_580 1_247_592_202 (same as with -j1) (same as with -j1)

eval

Before (main) -j1 After (main) -j1 Before (this PR) -j6 After (this PR) -j6
user time 475.29 508.43 485.92 506.08
percent CPU 99% 99% 498% 505%
wall time 7:58.65 8:32.86 1:38.66 1:41.22
max RSS 2_666_460 2_198_968 2_105_688 1_787_660
nDCG 0.4805 0.4659 0.4775 0.4688

Copy link
Copy Markdown
Member

@juhoinkinen juhoinkinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through the code and have no complains. :)

There could be a short explanation of the model architecture somewhere, maybe in the NN ensemble Wiki page.

We could try out if online learning works better with the new implementation.

@osma
Copy link
Copy Markdown
Member Author

osma commented Feb 12, 2026

@juhoinkinen Thanks for the new benchmark and the approval!

It's a pity that NDCG seems to have decreased. With the Finto AI (Finnish) training data set that I used for most tests, there was a nice increase in F1@5 scores of more than 0.02.

You reported that the model size is almost unchanged, but this number seems to include the LMDB which contains all the preprocessed training data. The actual model file (nn-model.pt) should be just a small fraction of the old model file (nn-model.keras). In my tests it went from 181MB to 618kB.

@osma
Copy link
Copy Markdown
Member Author

osma commented Feb 13, 2026

I did some further testing on YSO and KOKO based projects. Sadly, the evaluation results have in most cases decreased compared to the old NN ensemble (around 0.02 in F1@5 scores). I think the model architecture still needs some work - the current one may be too simple after all, even if it performs better than the old one in certain cases (e.g. JYU-theses/fi).

Also, the LMDB size now seems to grow faster than earlier. All the KOKO models required more than 2GB disk space, when previously they didn't hit the default 1GB limit. This needs to be investigated - it was not intended.

@mfakaehler
Copy link
Copy Markdown
Collaborator

Sandro and I have tested the new nn-ensemble backend. As a small disclaimer ahead: even with the old implementation, we have not yet found a configuration that has produced better results then plain ensemble. Maybe the GND with its >200K entities is too hard a problem for the architecture.

We tested the following config with the new and the old branches and trained the ensemble on ~250K tables of content.

projects.cfg
[gnd-nn-PyTorch]
name=gnd-nn-PyTorch
language=de
backend=nn_ensemble
sources=gnd-1299-ob-1.4.0-TOC-IHT-ger:0.5,gnd-1299-ob-1.4.0-TITLE-ger-2:0.32,gnd-1299-mllm-1.4.0-TOC-ger:0.06,gnd-1299-mllm-1.4.0-TITLE-ger:0.12
limit=100
vocab=gnd-202510-06b-reduced
nodes=100
dropout_rate=0.2
epochs=10
lmdb_map_size=7516192768

While the tensorflow implementation ran through and produced reasonable results, the pytorch version had serious trouble with the LMDB and aborted during document processing:

Backend nn_ensemble: Processing training documents...

resulting in the following error:

`lmdb.MapFullError: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached`
Traceback (most recent call last):
  File "/home/sandro/Annif_nn_dev/.venv/bin/annif", line 10, in <module>
    sys.exit(cli())
             ^^^^^
  File "/home/sandro/Annif_nn_dev/.venv/lib/python3.12/site-packages/click/core.py", line 1442, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sandro/Annif_nn_dev/.venv/lib/python3.12/site-packages/click/core.py", line 1363, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/sandro/Annif_nn_dev/.venv/lib/python3.12/site-packages/click/core.py", line 1830, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sandro/Annif_nn_dev/.venv/lib/python3.12/site-packages/click/core.py", line 1226, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sandro/Annif_nn_dev/.venv/lib/python3.12/site-packages/click/core.py", line 794, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sandro/Annif_nn_dev/.venv/lib/python3.12/site-packages/click/decorators.py", line 34, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sandro/Annif_nn_dev/.venv/lib/python3.12/site-packages/flask/cli.py", line 400, in decorator
    return ctx.invoke(f, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sandro/Annif_nn_dev/.venv/lib/python3.12/site-packages/click/core.py", line 794, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sandro/Annif_nn_dev/annif/cli.py", line 220, in run_train
    proj.train(documents, backend_params, jobs)
  File "/home/sandro/Annif_nn_dev/annif/project.py", line 286, in train
    self.backend.train(corpus, beparams, jobs)
  File "/home/sandro/Annif_nn_dev/annif/backend/backend.py", line 107, in train
    return self._train(corpus, params=beparams, jobs=jobs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sandro/Annif_nn_dev/annif/backend/nn_ensemble.py", line 275, in _train
    self._fit_model(
  File "/home/sandro/Annif_nn_dev/annif/backend/nn_ensemble.py", line 344, in _fit_model
    self._corpus_to_vectors(corpus, seq, n_jobs)
  File "/home/sandro/Annif_nn_dev/annif/backend/nn_ensemble.py", line 319, in _corpus_to_vectors
    seq.add_sample(score_vector, true_vector)
  File "/home/sandro/Annif_nn_dev/annif/backend/nn_ensemble.py", line 72, in add_sample
    self._txn.put(key, buf.read())
lmdb.MapFullError: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached

We restarted the process increasing LMDB size to 70GB and 400GB, both eventually aborting. Only after restricting the training documents to 25K (1/10 of the original), the actual pytorch training started and the process could finish. The final size of the file nn-train.mdb was 43GB, while the old branch produced a file of 1GB (while having processed 10x more documents).

We will report any other observations later.

@san-uh
Copy link
Copy Markdown

san-uh commented Feb 23, 2026

Here are the further results from a test that @mfakaehler and I carried out. Since the training data set of 260K could not be used (see above), we limited the training data to 25K.
We used nn classic based on Keras/Tensorflow with Annif.1.4.1 and, for the new nn based on PyTorch, the available dev branch.

Settings

Test case settings
testcase: German-language documents ; text kinds: table of contents (tocs)
vocabulary: 456,599 GND descriptors
traindata: 25,000 tocs
testdata: 40,885 tocs

Singlemodells
omikuji trained with 604,775 tocs and 251,818 blurbs
omikuji trained with 964,507 titles
mllm 1 trained with 5,000 tocs
mllm 2 trained with 5,000 titles

nn-ensemble parameters

nn classic (Keras/TensorFlow)
[25k-gnd-nn-Keras-TF]
name=25k-gnd-nn-Keras-TF
language=de
backend=nn_ensemble
sources=gnd-1299-ob-1.4.0-TOC-IHT-ger:0.5,gnd-1299-ob-1.4.0-TITLE-ger-2:0.32,gnd-1299-mllm-1.4.0-TOC-ger:0.06,gnd-1299-mllm-1.4.0-TITLE-ger:0.12
limit=100
vocab=gnd-202510-06b-reduced
nodes=100
dropout_rate=0.2
epochs=10
lmdb_map_size=7516192768
nn new PyTorch
[gnd-nn-PyTorch]
name=gnd-nn-PyTorch
language=de
backend=nn_ensemble
sources=gnd-1299-ob-1.4.0-TOC-IHT-ger:0.5,gnd-1299-ob-1.4.0-TITLE-ger-2:0.32,gnd-1299-mllm-1.4.0-TOC-ger:0.06,gnd-1299-mllm-1.4.0-TITLE-ger:0.12
limit=100
vocab=gnd-202510-06b-reduced
nodes=100
dropout_rate=0.2
epochs=10
lmdb_map_size=429496729600

Technical settings
1008 GB Memory
96 CPUs
2 GPUs (NVIDIA A100 80GB PCIe) available but were not used

Results

Below are some observations and analysis of the results. Information on memory and CPU usage can be found in the graphs and figures in Appendix 926-test-nn-cpu-memory.pdf

train (25.000 tocs)

nn classic 25k Keras/TensorFlow(main) -j80 nn new 25k PyTorch (this PR) -j80
real time 66m12,713s 32m4,964s
model disk size 1024 MB (nn-train.mdb) 2,6 GB (nn-model.keras) 43 GB (nn-train.mdb) 8,8 MB (nn-model.pt)

The mdb-file sizes differ enormously: despite the same training volume, the new nn PyTorch grows many times faster than nn classic. The model-files show quite the opposite behaviour.

Note: The new nn PyTorch ensemble has stopped after the 6th epoch due to the early stopping functionality. Message:
Backend nn_ensemble: Epoch 6/50: NDCG=0.9874
Backend nn_ensemble: Model no longer improving, using best epoch 3.
If all epochs had been performed, the estimated actual training time of nn PyTorch would have been many times higher.

A suggestion for the early stopping feature: Depending on the size of the vocabulary and, in particular, the training data set, it might be useful to have an option to individually specify the size of the random subset used (currently n=512 documents). In the DNB test case with a training data set of 25,000 documents, a higher number (e.g., 20% of the total) could be useful and lead to more robust statements. So you could either set EVAL_BATCH_SIZE proportional to the training data size, or make it configurable with the annif train call.
We ran our script with the same parameters as for the nn classic ensemble. Inspecting the code, we are not sure, however, if the parameters nodes, dropout_rate and epochs even exist for the new implementation?
Perhaps the option to disable early stopping in order to force training to continue until the end (i.e., across all epochs) could also be a good addition.

eval (40.885 tocs)

nn classic 25k Keras/TensorFlow(main) -j80 nn new 25k PyTorch (this PR) -j80
real time 12m53,328s 10m52,225s
F1@5 (doc avg) 0.3941 0.3954
F1@10 (doc avg) 0.2880 0.2843
NDCG@10 0.6383 0.6476

The new nn PyTorch is 2 minutes faster in evaluation than nn classic.

There is hardly any difference in performance: nn PyTorch and nn classic achieve (almost) identical values for F1@5. For F1@10, nn classic is slightly ahead, while the new nn PyTorch offers a slightly more optimized ranking for the top 10 suggestions (see NDCG@10).

index (40.885 tocs)

nn classic 25k Keras/TensorFlow(main) -j80 nn new 25k PyTorch (this PR) -j80
real time 66m12,713s 106m32,691s

The nn classic requires 0.097 seconds per document. The new nn PyTorch requires 0.156 seconds per document.

@osma
Copy link
Copy Markdown
Member Author

osma commented Feb 23, 2026

Thanks a lot for the detailed report @san-uh ! It is clear that there is something wrong with the LMDB database (nn-train.mdb) growing much faster than before. That obviously needs fixing.

After that, I will try to rework the model architecture to try to achieve better results with at least some of our data sets (e.g. KOKO based) where the outcome was much worse than for the old NN ensemble.

Regarding the early stopping heuristic and the set of 512 documents that is used to calculate the metric: you are right that this is a bit rigid. However, in my tests it didn't seem to matter that much - it might happen that choosing a non-ideal subset would cause the early stopping to happen one epoch earlier or later than ideal, but it should be pretty close to optimal regardless. However, I can try to make this more flexible.

@sonarqubecloud
Copy link
Copy Markdown

@osma
Copy link
Copy Markdown
Member Author

osma commented Feb 26, 2026

It took a while to hunt this down, but I found the cause of the massive increase in LMDB size. The code was using a suboptimal sparse matrix type (CSC instead of CSR). CSC used to be the right choice in the old code, but the new code flips some matrix dimensions around and now CSR is needed. The most recent commit 211fde9 fixes this.

The next step is further iteration on the model itself; I think some alternative approaches must be evaluated and the current one may bee too simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reimplement NN ensemble using Pytorch instead of TensorFlow

5 participants