Conversation
VitalyFedyunin
left a comment
There was a problem hiding this comment.
This DataPipe looks useful for other cases as well if me modify it a bit.
Could you please separate it by two pipes:
- 1st one to filter dict keys based on pattern:
stage1 = IterableWrapper([
{"1.txt": "1", "1.bin": "1b", "3.jpg":"foo"},
{"2.txt": "2", "2.bin": "2b"},
])
stage2 = ExtractKeys(stage1, "*.txt", "*.bin")
output = list(iter(stage2))
self.assertEqual({"1.txt": "1", "1.bin": "1b"}, output[0])- Second is simple map, to drop keys:
dp = dp.map(lambda x: x.values())
NivekT
left a comment
There was a problem hiding this comment.
Hi @tmbdev,
Thanks for your commits on these PRs. Let us know if these are ready for review (but no rush at all!). @VitalyFedyunin and I will be happy to have a look.
Again, thanks for contributing to our library!
|
|
||
|
|
||
| @functional_datapipe("extract_keys") | ||
| class ExtractKeysIterDataPipe(IterDataPipe[Dict]): |
There was a problem hiding this comment.
Can we rename this to KeyExtractor to follow our naming convention? Thanks.
We can still keep "extract_keys" as the functional name.
| with self.assertRaises(TypeError): | ||
| len(output_dp) | ||
|
|
||
| def test_extractor(self): |
There was a problem hiding this comment.
| def test_extractor(self): | |
| def test_key_extractor(self): |
nit: We used to have a different extractor
| duplicate_is_error: it is an error if the same key is selected twice (True) | ||
| ignore_missing: skip any dictionaries where one or more patterns don't match (False) |
There was a problem hiding this comment.
| duplicate_is_error: it is an error if the same key is selected twice (True) | |
| ignore_missing: skip any dictionaries where one or more patterns don't match (False) |
Duplicate lines of descriptions
| *args: list of glob patterns or list of glob patterns | ||
| duplicate_is_error: it is an error if the same key is selected twice (True) | ||
| ignore_missing: allow patterns not to match (i.e., incomplete outputs) | ||
| as_tuple: return a tuple instead of a dictionary |
There was a problem hiding this comment.
| as_tuple: return a tuple instead of a dictionary | |
| as_tuple: return a tuple instead of a dictionary (True or False here) |
| """ | ||
|
|
||
| def __init__( | ||
| self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False |
There was a problem hiding this comment.
Do we want to default as_tuple=False? Based on the docstring I would've guessed you wanted True instead.
| self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False | |
| self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error: bool = True, ignore_missing: bool = False, as_tuple: bool = False |
There was a problem hiding this comment.
nit: allow_duplicate might be a better name than duplicate_is_error
| def __len__(self) -> int: | ||
| return len(self.source_datapipe) |
There was a problem hiding this comment.
Question: A sample will always be yielded even if nothing matches right?
| duplicate_is_error: it is an error if the same key is selected twice (True) | ||
| ignore_missing: skip any dictionaries where one or more patterns don't match (False) | ||
| *args: list of glob patterns or list of glob patterns | ||
| duplicate_is_error: it is an error if the same key is selected twice (True) |
There was a problem hiding this comment.
| duplicate_is_error: it is an error if the same key is selected twice (True) | |
| duplicate_is_error: it is an error if the same key is selected twice (True), otherwise returns the first matched value |
| if len(matches) > 1 and self.duplicate_is_error: | ||
| raise ValueError(f"extract_keys: multiple sample keys {sample.keys()} match {pattern}.") | ||
| if matches[0] in used and self.duplicate_is_error: | ||
| raise ValueError(f"extract_keys: key {matches[0]} is selected twice.") |
There was a problem hiding this comment.
| raise ValueError(f"extract_keys: key {matches[0]} is selected twice.") | |
| raise ValueError(f"extract_keys: key {matches[0]} is selected twice by multiple patterns.") |
nit
| @functional_datapipe("extract_keys") | ||
| class ExtractKeysIterDataPipe(IterDataPipe[Dict]): | ||
| r""" | ||
| Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns. |
There was a problem hiding this comment.
| Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns. | |
| Given a stream of dictionaries, return a stream of dicts (or tuples) by selecting keys using glob patterns. |
| >>> dp = FileLister(...).load_from_tar().webdataset().decode(...).extract_keys(["*.jpg", "*.png"], "*.gt.txt") | ||
| """ |
There was a problem hiding this comment.
In addition to the one example with webdataset, please add an example with sample outputs here. Copying from the test cases is totally fine to me.
This PR adds an ExtractKeys filter that turns samples represented as dictionaries into tuples. Tuples are constructed by selecting values from the dictionaries by matching the key against a given set of patterns.