Update Transformers4Rec to use new dataloader package

Transformers4Rec depends on the NVTabular class [TorchAsyncItr](https://github.com/NVIDIA-Merlin/NVTabular/blob/59579f2c46006fcb22795623ee9400c658166670/nvtabular/loader/torch.py#L22) for the dataloader definition. As this class was already updated to use the new data loader,  T4Rec has been implicitly updated to use the new data loader which broke example notebooks, and CI tests. 

I am listing here the issues raised in T4Rec because of the new conventions implemented in the merlin loader. The goal is to give enough context to decide what should be updated (in t4rec or loader) and ensure a stable integration with  T4Rec: 
 
1. The new data loader is setting the input types from the source dataset being loaded (parquet files) while T4rec is following the convention of the old nvtabular where inputs are always converted to hard dtypes int32 and float32.  (more information can be found in this [ticket](https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/524)) 
2. As the data loader is not converting the features dtypes anymore, this conversion should happen in the dataset before saving to disk. As an example, the nvtabular criteo [notebook](https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/scaling-criteo/01-Download-Convert.ipynb) shows how to convert hexadecimal features to numerical ones before saving to disk. 
3. The T4Rec model is expecting the data loader to load only the features specified in the model definition while the new t4rec loader is returning all features specified in the original dataset. (more details can be found in this [PR](https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/536) description)
5. T4rec is built on top of the nvtabular class `TorchAsyncItr` which makes it sensitive to break each time nvtabular classes change. We could create the T4Rec loader class as a subclass of the `merlin.loader.torch.Loader`  instead. 
6. The t4rec data loader class that used the nvtabular loader is registered as `nvtabular`, after changing t4rec to use directly merlin data loader we could change the registered name to something like `merlin-loader`
7. Some datasets can have continuous features with different types (float32+float64). T4Rec models are expecting all the continuous features to have the same type (all float32 or all float64 for example). In more detail, the issue is raised because t4rec is concatenating all continuous values into one vector and using an MLP project to get their final embeddings. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Transformers4Rec to use new dataloader package #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Update Transformers4Rec to use new dataloader package #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions