Conversation
NivekT
left a comment
There was a problem hiding this comment.
Overall, the implementation makes sense but it feels slightly specific. I will let others chime in on their thoughts. Aside from that, just a few nit comments:
It is worth mentioning that we generally name our DataPipe following this convention:
https://github.com/pytorch/data/blob/9ad73d85b398828a919aef692c27f77182716aa1/docs/source/tutorial.rst#naming
I should add that to the contributing guide.
| output = list(iter(stage2)) | ||
| assert len(output) == 2 | ||
| assert set(output[0].keys()) == set(["t", "b"]) | ||
|
|
There was a problem hiding this comment.
Please test other boolean flags.
|
Please switch the order of inputs 'pattern' -> 'new name' looks more natural |
|
The usual usage is with keyword arguments using a simple key as output and a pattern as input. It also parallels assignment. I think this order is more useful. What do you think? |
|
In my opinion it makes sense to have two datapipes: |
|
|
||
| def __init__( | ||
| self, | ||
| source_datapipe: IterDataPipe[List[Union[Dict, List]]], |
There was a problem hiding this comment.
source_datapipe: IterDataPipe[Dict[Any, Any]]
There was a problem hiding this comment.
This function isn't primarily as a general stream transformation, but as a quick, simple, and readable way of extracting fields for further processing in a data pipeline. That is, usually this is used for getting tar files with different file name patterns and The current API addresses that use case really well. I would recommend keeping it as it is.
NivekT
left a comment
There was a problem hiding this comment.
Overall, it looks pretty good! Just a few comments. Thanks for the PR!
| keep_unselected: keep keys/value pairs even if they don't match any pattern (False) | ||
| must_match: all key value pairs must match (True) | ||
| duplicate_is_error: it is an error if two renamings yield the same key (True) |
There was a problem hiding this comment.
nit: Should we move these after *args?
| WebDatasetIterDataPipe as WebDataset, | ||
| ) | ||
| from torchdata.datapipes.iter.util.renamekeys import ( | ||
| KeyRenamerIterDataPipe as RenameKeys, |
There was a problem hiding this comment.
| KeyRenamerIterDataPipe as RenameKeys, | |
| KeyRenamerIterDataPipe as KeyRenamer, |
| *args, | ||
| keep_unselected=False, | ||
| must_match=True, | ||
| duplicate_is_error=True, |
There was a problem hiding this comment.
| duplicate_is_error=True, | |
| allow_duplicate=False, |
nit: might be a better name but feel free to ignore
| source_datapipe: a DataPipe yielding a stream of dictionaries. | ||
| keep_unselected: keep keys/value pairs even if they don't match any pattern (False) | ||
| must_match: all key value pairs must match (True) | ||
| duplicate_is_error: it is an error if two renamings yield the same key (True) |
There was a problem hiding this comment.
| duplicate_is_error: it is an error if two renamings yield the same key (True) | |
| duplicate_is_error: it is an error if two renamings yield the same key (True); otherwise the first matched one will be returned |
This PR adds a filter that allows keys to be renamed in training samples represented as dictionaries. This is particularly useful for webdataset-style data sets, but can also be used with other dictionary iterators.