[DataPipe] extract keys by tmbdev · Pull Request #406 · meta-pytorch/data

tmbdev · 2022-05-13T20:05:29Z

This PR adds an ExtractKeys filter that turns samples represented as dictionaries into tuples. Tuples are constructed by selecting values from the dictionaries by matching the key against a given set of patterns.

torchdata/datapipes/iter/util/extractkeys.py

VitalyFedyunin

This DataPipe looks useful for other cases as well if me modify it a bit.

Could you please separate it by two pipes:

1st one to filter dict keys based on pattern:

        stage1 = IterableWrapper([
            {"1.txt": "1", "1.bin": "1b", "3.jpg":"foo"},
            {"2.txt": "2", "2.bin": "2b"},
        ])
        stage2 = ExtractKeys(stage1, "*.txt", "*.bin")
        output = list(iter(stage2))
        self.assertEqual({"1.txt": "1", "1.bin": "1b"}, output[0])

Second is simple map, to drop keys:

dp = dp.map(lambda x: x.values())

test/test_iterdatapipe.py

NivekT

Hi @tmbdev,

Thanks for your commits on these PRs. Let us know if these are ready for review (but no rush at all!). @VitalyFedyunin and I will be happy to have a look.

Again, thanks for contributing to our library!

NivekT

A few comments. Feel free to not accept every requested change. Can you rebase as well?

Again, thank you so much for your contribution!

NivekT · 2022-09-13T20:18:49Z

torchdata/datapipes/iter/util/extractkeys.py

+
+
+@functional_datapipe("extract_keys")
+class ExtractKeysIterDataPipe(IterDataPipe[Dict]):


Can we rename this to KeyExtractor to follow our naming convention? Thanks.

We can still keep "extract_keys" as the functional name.

NivekT · 2022-09-13T20:23:09Z

test/test_iterdatapipe.py

        with self.assertRaises(TypeError):
            len(output_dp)

+    def test_extractor(self):


Suggested change

def test_extractor(self):

def test_key_extractor(self):

nit: We used to have a different extractor

NivekT · 2022-09-13T20:24:47Z

torchdata/datapipes/iter/util/extractkeys.py

+        duplicate_is_error: it is an error if the same key is selected twice (True)
+        ignore_missing: skip any dictionaries where one or more patterns don't match (False)


Suggested change

duplicate_is_error: it is an error if the same key is selected twice (True)

ignore_missing: skip any dictionaries where one or more patterns don't match (False)

Duplicate lines of descriptions

NivekT · 2022-09-13T20:25:31Z

torchdata/datapipes/iter/util/extractkeys.py

+        *args: list of glob patterns or list of glob patterns
+        duplicate_is_error: it is an error if the same key is selected twice (True)
+        ignore_missing: allow patterns not to match (i.e., incomplete outputs)
+        as_tuple: return a tuple instead of a dictionary


Suggested change

as_tuple: return a tuple instead of a dictionary

as_tuple: return a tuple instead of a dictionary (True or False here)

NivekT · 2022-09-13T20:27:07Z

torchdata/datapipes/iter/util/extractkeys.py

+    """
+
+    def __init__(
+        self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False


Do we want to default as_tuple=False? Based on the docstring I would've guessed you wanted True instead.

Suggested change

self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False

self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error: bool = True, ignore_missing: bool = False, as_tuple: bool = False

nit: allow_duplicate might be a better name than duplicate_is_error

NivekT · 2022-09-13T20:28:35Z

torchdata/datapipes/iter/util/extractkeys.py

+    def __len__(self) -> int:
+        return len(self.source_datapipe)


Question: A sample will always be yielded even if nothing matches right?

NivekT · 2022-09-13T20:52:32Z

torchdata/datapipes/iter/util/extractkeys.py

+        duplicate_is_error: it is an error if the same key is selected twice (True)
+        ignore_missing: skip any dictionaries where one or more patterns don't match (False)
+        *args: list of glob patterns or list of glob patterns
+        duplicate_is_error: it is an error if the same key is selected twice (True)


Suggested change

duplicate_is_error: it is an error if the same key is selected twice (True)

duplicate_is_error: it is an error if the same key is selected twice (True), otherwise returns the first matched value

NivekT · 2022-09-13T20:54:30Z

torchdata/datapipes/iter/util/extractkeys.py

+                if len(matches) > 1 and self.duplicate_is_error:
+                    raise ValueError(f"extract_keys: multiple sample keys {sample.keys()} match {pattern}.")
+                if matches[0] in used and self.duplicate_is_error:
+                    raise ValueError(f"extract_keys: key {matches[0]} is selected twice.")


Suggested change

raise ValueError(f"extract_keys: key {matches[0]} is selected twice.")

raise ValueError(f"extract_keys: key {matches[0]} is selected twice by multiple patterns.")

nit

NivekT · 2022-09-13T20:55:24Z

torchdata/datapipes/iter/util/extractkeys.py

+@functional_datapipe("extract_keys")
+class ExtractKeysIterDataPipe(IterDataPipe[Dict]):
+    r"""
+    Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns.


Suggested change

Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns.

Given a stream of dictionaries, return a stream of dicts (or tuples) by selecting keys using glob patterns.

NivekT · 2022-09-13T20:57:13Z

torchdata/datapipes/iter/util/extractkeys.py

+        >>> dp = FileLister(...).load_from_tar().webdataset().decode(...).extract_keys(["*.jpg", "*.png"], "*.gt.txt")
+    """


In addition to the one example with webdataset, please add an example with sample outputs here. Copying from the test cases is totally fine to me.

Tom added 2 commits May 13, 2022 12:27

merged

751da99

added extractkeys

5dc2a89

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2022

VitalyFedyunin reviewed May 19, 2022

View reviewed changes

torchdata/datapipes/iter/util/extractkeys.py Show resolved Hide resolved

VitalyFedyunin reviewed May 19, 2022

View reviewed changes

test/test_iterdatapipe.py Outdated Show resolved Hide resolved

VitalyFedyunin reviewed May 19, 2022

View reviewed changes

test/test_iterdatapipe.py Show resolved Hide resolved

VitalyFedyunin changed the title ~~extract keys~~ [DataPipe] extract keys May 19, 2022

VitalyFedyunin mentioned this pull request May 20, 2022

[DataPipe] key renamer #402

Open

tmbdev and others added 4 commits August 31, 2022 12:51

added as_tuple option, better testing, duplicate detection

59298b7

Merge branch 'main' into wdsextractkeys

45ae754

fixed type errors

b31d721

improved documentation in extract_keys

ba9b5a4

NivekT reviewed Sep 7, 2022

View reviewed changes

NivekT reviewed Sep 13, 2022

View reviewed changes



		@functional_datapipe("extract_keys")
		class ExtractKeysIterDataPipe(IterDataPipe[Dict]):

		duplicate_is_error: it is an error if the same key is selected twice (True)
		ignore_missing: skip any dictionaries where one or more patterns don't match (False)

	as_tuple: return a tuple instead of a dictionary
	as_tuple: return a tuple instead of a dictionary (True or False here)

	self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False
	self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error: bool = True, ignore_missing: bool = False, as_tuple: bool = False

	raise ValueError(f"extract_keys: key {matches[0]} is selected twice.")
	raise ValueError(f"extract_keys: key {matches[0]} is selected twice by multiple patterns.")

	Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns.
	Given a stream of dictionaries, return a stream of dicts (or tuples) by selecting keys using glob patterns.

		>>> dp = FileLister(...).load_from_tar().webdataset().decode(...).extract_keys([".jpg", ".png"], "*.gt.txt")
		"""

Conversation

tmbdev commented May 13, 2022

Uh oh!

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NivekT left a comment

Choose a reason for hiding this comment

Uh oh!

NivekT left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NivekT Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NivekT left a comment •

edited

Loading

NivekT Sep 13, 2022 •

edited

Loading