[DataPipe] key renamer by tmbdev · Pull Request #402 · meta-pytorch/data

tmbdev · 2022-05-13T19:51:44Z

This PR adds a filter that allows keys to be renamed in training samples represented as dictionaries. This is particularly useful for webdataset-style data sets, but can also be used with other dictionary iterators.

NivekT

Overall, the implementation makes sense but it feels slightly specific. I will let others chime in on their thoughts. Aside from that, just a few nit comments:

It is worth mentioning that we generally name our DataPipe following this convention:
https://github.com/pytorch/data/blob/9ad73d85b398828a919aef692c27f77182716aa1/docs/source/tutorial.rst#naming

I should add that to the contributing guide.

torchdata/datapipes/iter/util/renamekeys.py

test/test_iterdatapipe.py

VitalyFedyunin · 2022-05-19T20:27:50Z

test/test_iterdatapipe.py

+        output = list(iter(stage2))
+        assert len(output) == 2
+        assert set(output[0].keys()) == set(["t", "b"])
+


Please test other boolean flags.

VitalyFedyunin · 2022-05-19T20:34:51Z

Please switch the order of inputs 'pattern' -> 'new name' looks more natural

tmbdev · 2022-05-20T00:26:36Z

The usual usage is with keyword arguments using a simple key as output and a pattern as input. It also parallels assignment. I think this order is more useful. What do you think?

VitalyFedyunin · 2022-05-20T01:22:16Z

In my opinion it makes sense to have two datapipes:
pattern_filter_keys -> takes patterns, throws away all missmatch keys #406
and
pattern_rename_keys -> takes pattern->new_name dictionary and renames keys accordingly. In this case they will follow same API patterns and would be easy to remember.

VitalyFedyunin · 2022-05-20T01:25:19Z

torchdata/datapipes/iter/util/renamekeys.py

+
+    def __init__(
+        self,
+        source_datapipe: IterDataPipe[List[Union[Dict, List]]],


source_datapipe: IterDataPipe[Dict[Any, Any]]

This function isn't primarily as a general stream transformation, but as a quick, simple, and readable way of extracting fields for further processing in a data pipeline. That is, usually this is used for getting tar files with different file name patterns and The current API addresses that use case really well. I would recommend keeping it as it is.

NivekT

Overall, it looks pretty good! Just a few comments. Thanks for the PR!

NivekT · 2022-09-13T21:08:19Z

torchdata/datapipes/iter/util/renamekeys.py

+        keep_unselected: keep keys/value pairs even if they don't match any pattern (False)
+        must_match: all key value pairs must match (True)
+        duplicate_is_error: it is an error if two renamings yield the same key (True)


nit: Should we move these after *args?

NivekT · 2022-09-13T21:11:40Z

torchdata/datapipes/iter/__init__.py

+    WebDatasetIterDataPipe as WebDataset,
+)
+from torchdata.datapipes.iter.util.renamekeys import (
+    KeyRenamerIterDataPipe as RenameKeys,


Suggested change

KeyRenamerIterDataPipe as RenameKeys,

KeyRenamerIterDataPipe as KeyRenamer,

NivekT · 2022-09-13T21:13:36Z

torchdata/datapipes/iter/util/renamekeys.py

+        *args,
+        keep_unselected=False,
+        must_match=True,
+        duplicate_is_error=True,


Suggested change

duplicate_is_error=True,

allow_duplicate=False,

nit: might be a better name but feel free to ignore

NivekT · 2022-09-13T21:21:28Z

torchdata/datapipes/iter/util/renamekeys.py

+        source_datapipe: a DataPipe yielding a stream of dictionaries.
+        keep_unselected: keep keys/value pairs even if they don't match any pattern (False)
+        must_match: all key value pairs must match (True)
+        duplicate_is_error: it is an error if two renamings yield the same key (True)


Suggested change

duplicate_is_error: it is an error if two renamings yield the same key (True)

duplicate_is_error: it is an error if two renamings yield the same key (True); otherwise the first matched one will be returned

Tom added 2 commits May 13, 2022 12:33

merged

df6bb89

added renamekeys

1598726

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2022

NivekT reviewed May 19, 2022

View reviewed changes

torchdata/datapipes/iter/util/renamekeys.py Show resolved Hide resolved

torchdata/datapipes/iter/util/renamekeys.py Outdated Show resolved Hide resolved

NivekT changed the title ~~key renamer~~ [DataPipe] key renamer May 19, 2022

VitalyFedyunin reviewed May 19, 2022

View reviewed changes

VitalyFedyunin reviewed May 20, 2022

View reviewed changes

tmbdev and others added 3 commits August 31, 2022 17:06

renamed to KeyRenamerIterDatapipe

c250c51

resolved issues in renamekeys.py and improved tests

3035631

Merge branch 'main' into wdsrenamekeys

6153c85

NivekT reviewed Sep 13, 2022

View reviewed changes

	KeyRenamerIterDataPipe as RenameKeys,
	KeyRenamerIterDataPipe as KeyRenamer,

	duplicate_is_error: it is an error if two renamings yield the same key (True)
	duplicate_is_error: it is an error if two renamings yield the same key (True); otherwise the first matched one will be returned

Conversation

tmbdev commented May 13, 2022

Uh oh!

NivekT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin commented May 19, 2022

Uh oh!

tmbdev commented May 20, 2022

Uh oh!

VitalyFedyunin commented May 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NivekT left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants