Update dataloader to provide new output structure#101
Update dataloader to provide new output structure#101karlhigley merged 16 commits intoNVIDIA-Merlin:mainfrom
Conversation
|
@jperez999 @oliverholworthy would be interested in an early feedback :) |
Based on the TF RaggedTensor docs it seems like we could handle this with a multi-dimensional array of offsets in the |
thanks for the documentation - although I posted it - I havent seen it. I was wondering, how to deal with that case. I am not sure, if a multi-dimensional __offsets columns will work. We will need to be able to feed the same structure in Triton. I think the idea is that we provide only 1-d Tensors in every output column. In the TensorFlow documentations, both offsets vectors have different length: I think we should provide multiple offsets values with __offsets_0, __offsets_1, etc? What do you think? |
|
Multiple arrays worth of offsets sound reasonable, but I think we should ignore the case with multiple ragged dimensions for now, since we don't currently have a use case for it. |
| # ) | ||
| X[column_name] = self._to_sparse_tensor(X[column_name], column_name) | ||
| if self.lists_as_tuple: | ||
| tensor = (X[column_name][0], X[column_name][1][:-1]) |
There was a problem hiding this comment.
It seems like we want the full set of offsets here, not just the last one ([:-1])
There was a problem hiding this comment.
Currently, the PyTorch dataloader does NOT provide the last offsets (length of a batch). See comment: #101 (comment)
The function _to_sparse_tensor (actually I think the result is a dense_tensor) does not work, when the last offset (length) is part of the offset tensors.
There was a problem hiding this comment.
Ohhhhh, I get it now 👍🏻
There was a problem hiding this comment.
I noticed the issue was that the torch dataloader adds the last offset in the _to_sparse_tensor:
https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/torch.py#L148-L151
I will remove these lines in my updated PR
| device=None, | ||
| use_row_lengths=False, | ||
| tensors_as_1d = True, | ||
| lists_as_tuple = False |
There was a problem hiding this comment.
Not sure how others feel about this, but in the interest of simplifying the code I wonder if we can live without the extra configuration flags and just make the new behavior the default/standard
There was a problem hiding this comment.
Unless we're making the default values preserve the existing behaviour, it's going to be a breaking change that forces some action. I agree in this case it seems like a situation where not having the flags seems like a reasonable option in this case.
|
|
||
| def _reshape_dim(self, tensor): | ||
| if self.tensors_as_1d: | ||
| return tf.reshape(tensor, shape=[-1]) |
There was a problem hiding this comment.
👍🏻
In the course of researching this change, I learned that reshape is essentially free because it just changes how the data in memory is interpreted, but transpose actually copies the data so it's more expensive. In theory, either one would work here, but reshape is definitely the better choice.
| else: | ||
| return tf.reshape(tensor, shape=[-1, 1]) | ||
|
|
||
| def _add_last_offset(self, index, value): |
There was a problem hiding this comment.
I'm kind of surprised we're not already getting the length as one of the offsets 🤔
There was a problem hiding this comment.
I think the issue is how we create batches from a chunk of dataframe:
https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L411
If we have an offsets value: [0, 3, 5, 9, 10, 15, 20, 25, 27, 30] with batch_size 3 to
[0, 3, 5], [9, 10, 15, 20, 25, 27, 30] - the last offset will go as the initial offset into the new batch. The 2nd iteration will be [9, 10, 15], [20, 25, 27, 30].
The offests will be adjusted in line: https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L438
I thought about changing it, but that will result into missing the initial offset in the new batch.
[0, 3, 5, 9], [10, 15, 20, 25, 27, 30].
Note: Currently, the pytorch dataloader does NOT return the last offset values of a batch.
I think we need to duplicate the boundary offsets (9 and 20 in the example). I am not sure, if there is a cleaner way to achieve it. Without trying to refactor the full logic ( https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L388-L440 ), I felt that is the best way.
There was a problem hiding this comment.
Yeah, that makes sense. Thanks for the explanation! I can't think of a cleaner way to do it yet either.
There was a problem hiding this comment.
I am not sure, if there is a cleaner way to achieve it. Without trying to refactor the full logic
Working on refactoring the full logic you refererenced in here #104 which may remove the requirement for this method.
That PR (#104) is currently incomplete since it returns only row_lengths currently. (need to add a final row_lengths -> offsets conversion). the row_lengths representation turned out to be easier to work with for the splitting into batches since the batch split array corresponds to the row lengths and not the offsets.
| if self.tensors_as_1d: | ||
| return tensor.view(-1) | ||
| else: | ||
| return tensor.view(-1, 1) |
There was a problem hiding this comment.
view() seems good here 👍🏻
There was a problem hiding this comment.
I tried squeeze - but it was incorrect for a single batch
{"cat": [1]} got squeezed to {"cat" 1}
| print(col) | ||
| print(feature_tensor) | ||
| if tensors_as_1d or "__values" in col or "__offsets" in col or col in spa_lst: | ||
| assert len(list(feature_tensor.shape)) == 1 |
There was a problem hiding this comment.
Might be good to add test assertions about the offsets either in this test or in a new test, just to double-check that the resulting offsets are what we expect
|
I had to modify some unittests, which I think were incorrect/do not work. Updates based on the discussion on slack + meeting:
|
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
bschifferer
left a comment
There was a problem hiding this comment.
I reviewed @oliverholworthy changes. I guess that he modified the logic that certain functions are not required anymore (add last offset or reshape the tensors). As unittests are working, I think the code is working correct.
Co-authored-by: Karl Higley <kmhigley@gmail.com>
Currently the dataloaders and Merlin Models use values/row_lengths, which is incompatible both with Triton’s ragged batching support
We want to update:
{“col_name__values”: [1,2…], “col_name__offsets”: [0,4,7]}
Currently: