Update dataloader to provide new output structure by bschifferer · Pull Request #101 · NVIDIA-Merlin/dataloader

bschifferer · 2023-02-17T14:36:44Z

Currently the dataloaders and Merlin Models use values/row_lengths, which is incompatible both with Triton’s ragged batching support

We want to update:

The (values, row_lengths) tuple is also problematic and we should get rid of it, because it leads to inconsistent column names between Merlin schemas and Tensorflow saved models (TF assigns weird suffixes)
We should use a flat dictionary with separate keys for values and offsets at the boundaries of models, like:
{“col_name__values”: [1,2…], “col_name__offsets”: [0,4,7]}
What about multiple ragged dimensions (more than one offsets array: example from TensorFlow docs )

Currently:

Added support to return only 1d columns
Fixed bug that single categorical column was 1d although it should be 2d (backwards compatible)
Return list columns not as an tuple

bschifferer · 2023-02-17T15:03:46Z

@jperez999 @oliverholworthy would be interested in an early feedback :)

oliverholworthy · 2023-02-17T15:34:00Z

What about multiple ragged dimensions (more than one offsets array: example from TensorFlow docs )

Based on the TF RaggedTensor docs it seems like we could handle this with a multi-dimensional array of offsets in the {col}__offsets value. so the proposed output may not need to change if we implement support for multiple ragged dims in future.

bschifferer · 2023-02-17T18:12:09Z

example from TensorFlow docs

thanks for the documentation - although I posted it - I havent seen it. I was wondering, how to deal with that case.

I am not sure, if a multi-dimensional __offsets columns will work. We will need to be able to feed the same structure in Triton. I think the idea is that we provide only 1-d Tensors in every output column. In the TensorFlow documentations, both offsets vectors have different length:

rt = tf.RaggedTensor.from_row_splits(
    values=tf.RaggedTensor.from_row_splits(
        values=[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        row_splits=[0, 3, 3, 5, 9, 10]),
    row_splits=[0, 1, 1, 5])

I think we should provide multiple offsets values with __offsets_0, __offsets_1, etc? What do you think?

karlhigley · 2023-02-21T15:25:26Z

Multiple arrays worth of offsets sound reasonable, but I think we should ignore the case with multiple ragged dimensions for now, since we don't currently have a use case for it.

karlhigley · 2023-02-21T15:28:55Z

                # )
-                X[column_name] = self._to_sparse_tensor(X[column_name], column_name)
+                if self.lists_as_tuple:
+                    tensor = (X[column_name][0], X[column_name][1][:-1])


It seems like we want the full set of offsets here, not just the last one ([:-1])

Currently, the PyTorch dataloader does NOT provide the last offsets (length of a batch). See comment: #101 (comment)

The function _to_sparse_tensor (actually I think the result is a dense_tensor) does not work, when the last offset (length) is part of the offset tensors.

Ohhhhh, I get it now 👍🏻

I noticed the issue was that the torch dataloader adds the last offset in the _to_sparse_tensor:
https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/torch.py#L148-L151

I will remove these lines in my updated PR

karlhigley · 2023-02-21T15:29:52Z

        device=None,
+        use_row_lengths=False,
+        tensors_as_1d = True,
+        lists_as_tuple = False


Not sure how others feel about this, but in the interest of simplifying the code I wonder if we can live without the extra configuration flags and just make the new behavior the default/standard

Unless we're making the default values preserve the existing behaviour, it's going to be a breaking change that forces some action. I agree in this case it seems like a situation where not having the flags seems like a reasonable option in this case.

karlhigley · 2023-02-21T15:31:28Z

+
+    def _reshape_dim(self, tensor):
+        if self.tensors_as_1d:
+            return tf.reshape(tensor, shape=[-1])


👍🏻

In the course of researching this change, I learned that reshape is essentially free because it just changes how the data in memory is interpreted, but transpose actually copies the data so it's more expensive. In theory, either one would work here, but reshape is definitely the better choice.

karlhigley · 2023-02-21T15:32:11Z

+        else:
+            return tf.reshape(tensor, shape=[-1, 1])
+
+    def _add_last_offset(self, index, value):


I'm kind of surprised we're not already getting the length as one of the offsets 🤔

I think the issue is how we create batches from a chunk of dataframe:
https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L411

If we have an offsets value: [0, 3, 5, 9, 10, 15, 20, 25, 27, 30] with batch_size 3 to
[0, 3, 5], [9, 10, 15, 20, 25, 27, 30] - the last offset will go as the initial offset into the new batch. The 2nd iteration will be [9, 10, 15], [20, 25, 27, 30].

The offests will be adjusted in line: https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L438

I thought about changing it, but that will result into missing the initial offset in the new batch.
[0, 3, 5, 9], [10, 15, 20, 25, 27, 30].

Note: Currently, the pytorch dataloader does NOT return the last offset values of a batch.

I think we need to duplicate the boundary offsets (9 and 20 in the example). I am not sure, if there is a cleaner way to achieve it. Without trying to refactor the full logic ( https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L388-L440 ), I felt that is the best way.

Yeah, that makes sense. Thanks for the explanation! I can't think of a cleaner way to do it yet either.

I am not sure, if there is a cleaner way to achieve it. Without trying to refactor the full logic

Working on refactoring the full logic you refererenced in here #104 which may remove the requirement for this method.

That PR (#104) is currently incomplete since it returns only row_lengths currently. (need to add a final row_lengths -> offsets conversion). the row_lengths representation turned out to be easier to work with for the splitting into batches since the batch split array corresponds to the row lengths and not the offsets.

karlhigley · 2023-02-21T15:33:23Z

+        if self.tensors_as_1d:
+            return tensor.view(-1)
+        else:
+            return tensor.view(-1, 1)


view() seems good here 👍🏻

I tried squeeze - but it was incorrect for a single batch

{"cat": [1]} got squeezed to {"cat" 1}

karlhigley · 2023-02-21T15:34:47Z

+            print(col)
+            print(feature_tensor)
+            if tensors_as_1d or "__values" in col or "__offsets" in col or col in spa_lst:
+                assert len(list(feature_tensor.shape)) == 1


Might be good to add test assertions about the offsets either in this test or in a new test, just to double-check that the resulting offsets are what we expect

bschifferer · 2023-02-23T11:10:13Z

I had to modify some unittests, which I think were incorrect/do not work.

Updates based on the discussion on slack + meeting:

Default behavior is 1d shape of tensor
Default behavior is list columns are represented as values and offsets (both TensorFlow and PyTorch)
Default behavior is list columns are NOT tuples anymore and part of the dictonary
Backwards compatibility is dropped to keep the API simple

I dropped nested list columns support in TensorFlow. As we moved to values, offsets structure, the unittest does not work for nested columns. We do not need nested lists, yet and we should enable the feature for TensorFlow and PyTorch and not only one framework:
https://github.com/NVIDIA-Merlin/dataloader/pull/101/files#diff-2895e2ef272fe5a536569e29059f5811d4d6d421650818baa7204eae369cda44R107-R112
We test the validater in the TensorFlow script. First, the original model had softmax activation function, although we train with binary entropy. As we had only 1 outout logit, the model always predicts 1. So that seems off to me. I increased batch_size and number of samples to have a more reliable metrics (more samples -> better convergence). However, keras metrics are not accurate in the test. I increased the threshhold and we need a follow up ticket - why keras metrics are off. In my opinion, this could move to Merlin Models:
https://github.com/NVIDIA-Merlin/dataloader/pull/101/files#diff-2895e2ef272fe5a536569e29059f5811d4d6d421650818baa7204eae369cda44L416
The unittest with multiple label columns had the same issue. It uses binary entropy, but it provides 5x output logits in the 2nd output. The model would interpret this as 5x independent binary targets and expects a target column with [batchsize, 5]. I changed the loss function to SparseCategoricalEntropty which would fit that use case better
https://github.com/NVIDIA-Merlin/dataloader/pull/101/files#diff-2895e2ef272fe5a536569e29059f5811d4d6d421650818baa7204eae369cda44R705

…ader into change_output

review-notebook-app · 2023-03-06T16:51:37Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

bschifferer

I reviewed @oliverholworthy changes. I guess that he modified the logic that certain functions are not required anymore (add last offset or reshape the tensors). As unittests are working, I think the code is working correct.

Co-authored-by: Karl Higley <kmhigley@gmail.com>

pytorch support

a8c83dc

bschifferer requested review from jperez999 and oliverholworthy February 17, 2023 15:03

add tensorflow

a17c244

karlhigley reviewed Feb 21, 2023

View reviewed changes

Comment thread merlin/dataloader/loader_base.py Outdated

karlhigley reviewed Feb 21, 2023

View reviewed changes

bschifferer added 3 commits February 22, 2023 21:16

make default

6825458

tf unittest

55ddb65

update tf and torch with working unittests

81ea792

bschifferer added the enhancement New feature or request label Feb 23, 2023

bschifferer added 2 commits February 23, 2023 09:52

fix tf unittests

2267cba

jax

29aa350

bschifferer and others added 3 commits February 23, 2023 13:05

Merge branch 'main' into change_output

ece3d45

remove prints

183c0d4

Merge branch 'change_output' of https://github.com/bschifferer/datalo…

2bbccd3

…ader into change_output

karlhigley approved these changes Feb 23, 2023

View reviewed changes

karlhigley assigned bschifferer Feb 23, 2023

karlhigley added this to the Merlin 23.03 milestone Feb 23, 2023

bschifferer changed the title ~~[WIP] update dataloader to provide new output structure~~ Update dataloader to provide new output structure Feb 23, 2023

Merge branch 'main' into 'change_output'

4a21453

oliverholworthy added 3 commits March 6, 2023 17:02

Pass dtype to zero valuet tensor in _row_lengths_to_offsets

ccf1d90

Remove use of reshape_dim from _tensor_split methods

186121c

Remove unused _add_last_offset method

56cc260

bschifferer commented Mar 7, 2023

View reviewed changes

karlhigley reviewed Mar 7, 2023

View reviewed changes

Comment thread merlin/dataloader/loader_base.py Outdated

Remove commented ValueError

ce6ba6e

Co-authored-by: Karl Higley <kmhigley@gmail.com>

gabrielspmoreira mentioned this pull request Mar 10, 2023

Updates Models to support new dataloader format for lists (__values and __offsets in dict) and scalar (1D) NVIDIA-Merlin/models#999

Merged

oliverholworthy mentioned this pull request Mar 14, 2023

Update padding of ragged features to enable dataloader change NVIDIA-Merlin/Transformers4Rec#647

Merged

Merge branch 'main' into change_output

f07dd67

karlhigley merged commit 3d1845b into NVIDIA-Merlin:main Mar 15, 2023

karlhigley added the breaking label Mar 16, 2023

oliverholworthy linked an issue Mar 27, 2023 that may be closed by this pull request

Change dataloader to output 1D tensors for scalar features #105

Closed

oliverholworthy mentioned this pull request Mar 27, 2023

adding array version of dataloader #111

Merged

Conversation

bschifferer commented Feb 17, 2023

Uh oh!

bschifferer commented Feb 17, 2023

Uh oh!

oliverholworthy commented Feb 17, 2023

Uh oh!

bschifferer commented Feb 17, 2023

Uh oh!

karlhigley commented Feb 21, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bschifferer commented Feb 23, 2023

Uh oh!

review-notebook-app bot commented Mar 6, 2023

Uh oh!

bschifferer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants