Skip to content

Update padding of ragged features to enable dataloader change #647

Merged
karlhigley merged 13 commits intoNVIDIA-Merlin:mainfrom
oliverholworthy:dataloader-dense-padding
Mar 15, 2023
Merged

Update padding of ragged features to enable dataloader change #647
karlhigley merged 13 commits intoNVIDIA-Merlin:mainfrom
oliverholworthy:dataloader-dense-padding

Conversation

@oliverholworthy
Copy link
Copy Markdown
Contributor

Goals ⚽

Implementation Details 🚧

  • Implements the equivalent of the padding currently in the dataloader
    • Uses ragged -> sparse -> dense conversion which appears to be faster than an alternative approach assigning values to a tensor constructed with zeros

Testing Details 🔍

  • Existing tests should cover existing usage of the Merlin Dataloader
  • Adds unit tests for padding function to check pad batch

@oliverholworthy oliverholworthy added the chore Maintenance for the repository label Mar 14, 2023
@oliverholworthy oliverholworthy self-assigned this Mar 14, 2023
@oliverholworthy oliverholworthy added this to the Merlin 23.03 milestone Mar 14, 2023
@github-actions
Copy link
Copy Markdown

conts=None,
labels=None,
):
schema = schema.select_by_name(conts + cats + labels)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should remove the =None for cats, conts & labels?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's clearer without the default now. it wouldn't have worked as None before this change either.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I've put the None back and added support for the None to the method by setting to an empty list if not provided.

It doesn't appear to be captured by any tests, but this would enable training a model with only continuous or only categorical features. Currently you need to have at least one of each for the current version of this MerlinDataloader to work.

Comment thread transformers4rec/torch/utils/padding.py Outdated

batch_padded = {}
for k, values in batch.items():
if k.endswith("__values"):
Copy link
Copy Markdown
Contributor

@marcromeyn marcromeyn Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to put __values and __offsets as constants somewhere in merlin-core? Or some util-method, like is_value or something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps TensorTable would be useful here since that seems to handle both the values/offsets dictionary and tuple variants for us.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 43d535d

return pad_fn

@staticmethod
def _augment_schema(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this from schema_utiils to here so that it's closer to the only place it's called in the codebase.

@karlhigley karlhigley merged commit 8c13823 into NVIDIA-Merlin:main Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Maintenance for the repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants