Tensorflow 2.17 updates by lfoppiano · Pull Request #181 · kermitt2/delft

lfoppiano · 2025-05-28T20:38:24Z

@kermitt2 I've got an issue when training with 2.17 (it seems that the TensorFlow docker image uses python 3.11, though).

I've got this issue when training even a simple model like date:

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1748463701.859042 1108825 service.cc:146] XLA service 0x7f12547d4870 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1748463701.859079 1108825 service.cc:154]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2025-05-28 22:21:41.865569: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2025-05-28 22:21:41.883818: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8906
I0000 00:00:1748463701.939038 1108825 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
10/10 [==============================] - ETA: 0s - loss: 5.3103 - crf_loss: 5.3103      f1 (micro): 85.80
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/netscratch/lfoppiano/delft/delft_tf2.17.1/delft/applications/grobidTagger.py", line 475, in <module>
    train(model, 
    ^^^^^^^^^^^^
  File "/netscratch/lfoppiano/delft/delft_tf2.17.1/delft/applications/grobidTagger.py", line 222, in train
    model.train(x_train, y_train, f_train, x_valid, y_valid, f_valid, incremental=incremental, multi_gpu=multi_gpu)
  File "/netscratch/lfoppiano/delft/delft_tf2.17.1/delft/sequenceLabelling/wrapper.py", line 165, in train
    self.train_(x_train, y_train, f_train, x_valid, y_valid, f_valid, incremental, callbacks)
  File "/netscratch/lfoppiano/delft/delft_tf2.17.1/delft/sequenceLabelling/wrapper.py", line 216, in train_
    trainer.train(x_train, y_train, x_valid, y_valid, features_train=f_train, features_valid=f_valid, callbacks=callbacks)
  File "/netscratch/lfoppiano/delft/delft_tf2.17.1/delft/sequenceLabelling/trainer.py", line 59, in train
    self.model = self.train_model(self.model, x_train, y_train, x_valid=x_valid, y_valid=y_valid,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/netscratch/lfoppiano/delft/delft_tf2.17.1/delft/sequenceLabelling/trainer.py", line 175, in train_model
    local_model.fit(training_generator,
  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/netscratch/lfoppiano/delft/delft_tf2.17.1/delft/sequenceLabelling/trainer.py", line 279, in on_epoch_end
    logs.update({"lr": self.model.optimizer._decayed_lr(tf.float32)})
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Adam' object has no attribute '_decayed_lr'
2025-05-28 22:21:54.133744: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
         [[{{node PyFunc}}]]

The fix in this branch solved the issue for me (the libraries updates are just minor)

lfoppiano · 2025-05-28T20:39:16Z

As a separate comment, I also have this occasionally, but I cannot figure out where they come from, this is not fatal, but might indicate some bugs:

2025-05-28 22:21:35.126118: W tensorflow/core/common_runtime/type_inference.cc:339] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_FLOAT
    }
  }
}
 is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_BOOL
    }
  }
}

        for Tuple type infernce function 0
        while inferring type of node 'cond/output/_757'

de-code · 2025-12-19T19:49:47Z

As a separate comment, I also have this occasionally, but I cannot figure out where they come from, this is not fatal, but might indicate some bugs:

I came across that and with the help of AI, enabling eager mode helped:
eLifePathways/sciencebeam-trainer-delft@87a40f2

lfoppiano · 2026-02-21T19:44:05Z

@de-code It works. Thanks 😄

Question: why did you remove the different allocation of loss function?

EDIT: actually I did not manage to get pass the second epoch... 😭

… allowing existing paths.

…accesss

Signed-off-by: Luca Foppiano <luca@foppiano.org>

…tion scripts

Signed-off-by: Luca Foppiano <luca@foppiano.org>

Fix Tensorflow 2.17 training

lfoppiano added 2 commits May 28, 2025 22:33

update libraries

c4ea1ed

Fix crash when accessing learning rate

adaf99d

lfoppiano mentioned this pull request Aug 26, 2025

Update to JDK21 and Gradle 9 grobidOrg/grobid#1321

Merged

fix: align versions

c8dd029

lfoppiano force-pushed the tensorflow-2.17-updates branch 3 times, most recently from 5148b00 to a212306 Compare February 21, 2026 20:30

fix: conflict between cuda and torch

2b3aa3d

lfoppiano force-pushed the tensorflow-2.17-updates branch from 86686e2 to 2b3aa3d Compare February 22, 2026 07:35

lfoppiano added 18 commits February 22, 2026 08:37

fix: update macos requirements

35b93dc

fix: disable multiprocessing when training with mac

2edb191

fix: prevent error when creating existing embedding LMDB directory by…

f270733

… allowing existing paths.

feat: store embeddings in float32 instead of pickle

b29108a

fix: make embeddings pickable

9b9a552

fix: remove disable of multiprocessing for mac os

df5de56

feat: reopen embeddings in multithread to enable concurrency in LMDB …

3fabecb

…accesss

feat: add num_workers parameter for configurable data loading

e26239b

feat: add slurm scripts

73936b5

Signed-off-by: Luca Foppiano <luca@foppiano.org>

fix: adjust paths

c45d121

feat: add distributed scripts for SLURM

b060464

Signed-off-by: Luca Foppiano <luca@foppiano.org>

feat: train incrementally after the first round

1fcafc8

feat: add num_workers support for header model in training and evalua…

f0a268f

…tion scripts

chore: update the data, use a sortable date in the name

a68e433

chore: remove segmentation model

ebe6836

feat: get the latest training data when no input file is specified

630e902

feat: include citation model in the one with num-workers = 6

a9146fc

Re-trained models with the new tensorflow version

3037c65

Signed-off-by: Luca Foppiano <luca@foppiano.org>

lfoppiano and others added 14 commits February 25, 2026 07:45

fix: refine scripts

98b508e

fix: update scripts

f34fc27

chore: update imports to use the right keras version

a1db014

chore: cleanup scripts

dfbb116

fix: move from srun to sbatch

16be6f3

fix: add fake import to get the correct CUDA

66952db

chore: add proper communication

d1e97e2

fix dependencies

32ca206

fix: license and copyright distributed script

3674e43

feat: retrain models

2467d92

Signed-off-by: Luca Foppiano <luca@foppiano.org>

update script

e49d0ee

Signed-off-by: Luca Foppiano <luca@foppiano.org>

feat: add script for retrain the citation models

87a69dc

Merge pull request #191 from kermitt2/tensorflow-2.17-training

e223d1d

Fix Tensorflow 2.17 training

chore: update documentation

3484156

lfoppiano merged commit eb57a80 into tensorflow-2.17 Feb 28, 2026
2 checks passed

lfoppiano deleted the tensorflow-2.17-updates branch February 28, 2026 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow 2.17 updates#181

Tensorflow 2.17 updates#181
lfoppiano merged 36 commits into
tensorflow-2.17from
tensorflow-2.17-updates

lfoppiano commented May 28, 2025 •

edited

Loading

Uh oh!

lfoppiano commented May 28, 2025

Uh oh!

de-code commented Dec 19, 2025

Uh oh!

lfoppiano commented Feb 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lfoppiano commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfoppiano commented May 28, 2025

Uh oh!

de-code commented Dec 19, 2025

Uh oh!

lfoppiano commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lfoppiano commented May 28, 2025 •

edited

Loading

lfoppiano commented Feb 21, 2026 •

edited

Loading