Transformers4Rec depends on the NVTabular class TorchAsyncItr for the dataloader definition. As this class was already updated to use the new data loader, T4Rec has been implicitly updated to use the new data loader which broke example notebooks, and CI tests.
I am listing here the issues raised in T4Rec because of the new conventions implemented in the merlin loader. The goal is to give enough context to decide what should be updated (in t4rec or loader) and ensure a stable integration with T4Rec:
- The new data loader is setting the input types from the source dataset being loaded (parquet files) while T4rec is following the convention of the old nvtabular where inputs are always converted to hard dtypes int32 and float32. (more information can be found in this ticket)
- As the data loader is not converting the features dtypes anymore, this conversion should happen in the dataset before saving to disk. As an example, the nvtabular criteo notebook shows how to convert hexadecimal features to numerical ones before saving to disk.
- The T4Rec model is expecting the data loader to load only the features specified in the model definition while the new t4rec loader is returning all features specified in the original dataset. (more details can be found in this PR description)
- T4rec is built on top of the nvtabular class
TorchAsyncItr which makes it sensitive to break each time nvtabular classes change. We could create the T4Rec loader class as a subclass of the merlin.loader.torch.Loader instead.
- The t4rec data loader class that used the nvtabular loader is registered as
nvtabular, after changing t4rec to use directly merlin data loader we could change the registered name to something like merlin-loader
- Some datasets can have continuous features with different types (float32+float64). T4Rec models are expecting all the continuous features to have the same type (all float32 or all float64 for example). In more detail, the issue is raised because t4rec is concatenating all continuous values into one vector and using an MLP project to get their final embeddings.
Transformers4Rec depends on the NVTabular class TorchAsyncItr for the dataloader definition. As this class was already updated to use the new data loader, T4Rec has been implicitly updated to use the new data loader which broke example notebooks, and CI tests.
I am listing here the issues raised in T4Rec because of the new conventions implemented in the merlin loader. The goal is to give enough context to decide what should be updated (in t4rec or loader) and ensure a stable integration with T4Rec:
TorchAsyncItrwhich makes it sensitive to break each time nvtabular classes change. We could create the T4Rec loader class as a subclass of themerlin.loader.torch.Loaderinstead.nvtabular, after changing t4rec to use directly merlin data loader we could change the registered name to something likemerlin-loader