Conversation
ejguan
left a comment
There was a problem hiding this comment.
Thank you! Overall LGTM with a few nit comments
| elem = next(iter(datapipe)) | ||
| assert type(elem) is dict | ||
| assert elem["package_name"] == "com.mantz_it.rfanalyzer" | ||
| mock_load_dataset.assert_called_with( | ||
| path="lhoestq/demo1", streaming=False, split="train", revision="branch", use_auth_token=True | ||
| ) |
There was a problem hiding this comment.
Could you please add one more line to test if there is only one element yielded from the datapipe?
There was a problem hiding this comment.
What exactly do you mean? _get_response_from_huggingface_hub() returns an iterator over the dataset and we look at the first element in line 37-39.
There was a problem hiding this comment.
This test validates the first output is the right one. I want to check a StopIteration should be raised when calling next over the iterator one more time
| self.config_kwargs = config_kwargs | ||
| warnings.warn( | ||
| "default behavior of HuggingFaceHubReader will change in version 0.7", DeprecationWarning, stacklevel=2 | ||
| ) |
There was a problem hiding this comment.
| ) | |
| if "split" not in self.config_kwargs: | |
| warnings.warn("Default value of `split` will be changed to None in version 0.7", FutureWarning) |
| if "split" not in self.config_kwargs: | ||
| warnings.warn("Default value of `split` will be changed to None in version 0.7", FutureWarning) | ||
| if "revision" not in self.config_kwargs: | ||
| warnings.warn("Default value of `revision` will be changed to None in version 0.7", FutureWarning) |
There was a problem hiding this comment.
Looks like the default arguments are changed. @ejguan it will be slightly BC-breaking. WDYT?
There was a problem hiding this comment.
Wait, the default arguments remain the same right?
split="train"
revision="main"
streaming=True
The default arguments are assigned in _get_response_from_huggingface_hub.
There was a problem hiding this comment.
Oh in that case I'm fine with it, only that users will not be able to see those default arguments from IDE autocomplete. which is suboptimal but not a blocker.
Is the warning for Streaming missing or we want it to stay True? The default is False for HG's version.
There was a problem hiding this comment.
Is the warning for Streaming missing or we want it to stay
True?
Good question. I personally like streaming=True to incorporate the style of large-dataset.
|
@SvenDS9 Could you please do a rebase onto main branch? |
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…rch#944) * do not test against HuggingFace production
6b7ec81 to
45536d2
Compare
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
| split: Union[str, datasets.Split] = "train", | ||
| revision: Union[str, datasets.Version] = "main", |
There was a problem hiding this comment.
Let's keep the annotation as str to make sure datasets as optional dependency
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Fixes #944
Changes
load_dataset(from HuggingFace) is called with correct parameters@ejguan could you please have a look