Skip to content

Huggingface all options#952

Closed
SvenDS9 wants to merge 6 commits intometa-pytorch:mainfrom
SvenDS9:huggingface_all_options
Closed

Huggingface all options#952
SvenDS9 wants to merge 6 commits intometa-pytorch:mainfrom
SvenDS9:huggingface_all_options

Conversation

@SvenDS9
Copy link
Contributor

@SvenDS9 SvenDS9 commented Jan 19, 2023

Fixes #944

Changes

  • Changed test setup for HuggingFaceHubReader. Do not test against production but ensure load_dataset (from HuggingFace) is called with correct parameters
  • Include HuggingFaceHubReader in documentation

@ejguan could you please have a look

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 19, 2023
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Overall LGTM with a few nit comments

Comment on lines 37 to +44
elem = next(iter(datapipe))
assert type(elem) is dict
assert elem["package_name"] == "com.mantz_it.rfanalyzer"
mock_load_dataset.assert_called_with(
path="lhoestq/demo1", streaming=False, split="train", revision="branch", use_auth_token=True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add one more line to test if there is only one element yielded from the datapipe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you mean? _get_response_from_huggingface_hub() returns an iterator over the dataset and we look at the first element in line 37-39.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test validates the first output is the right one. I want to check a StopIteration should be raised when calling next over the iterator one more time

SvenDS9 added a commit to SvenDS9/PytorchData that referenced this pull request Jan 23, 2023
self.config_kwargs = config_kwargs
warnings.warn(
"default behavior of HuggingFaceHubReader will change in version 0.7", DeprecationWarning, stacklevel=2
)
Copy link
Contributor

@ejguan ejguan Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
)
if "split" not in self.config_kwargs:
warnings.warn("Default value of `split` will be changed to None in version 0.7", FutureWarning)

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @NivekT Do you want to chime in for any concern?

Comment on lines +78 to +81
if "split" not in self.config_kwargs:
warnings.warn("Default value of `split` will be changed to None in version 0.7", FutureWarning)
if "revision" not in self.config_kwargs:
warnings.warn("Default value of `revision` will be changed to None in version 0.7", FutureWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the default arguments are changed. @ejguan it will be slightly BC-breaking. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, the default arguments remain the same right?
split="train"
revision="main"
streaming=True

The default arguments are assigned in _get_response_from_huggingface_hub.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh in that case I'm fine with it, only that users will not be able to see those default arguments from IDE autocomplete. which is suboptimal but not a blocker.

Is the warning for Streaming missing or we want it to stay True? The default is False for HG's version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the warning for Streaming missing or we want it to stay True?

Good question. I personally like streaming=True to incorporate the style of large-dataset.

@ejguan
Copy link
Contributor

ejguan commented Jan 23, 2023

@SvenDS9 Could you please do a rebase onto main branch?

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan ejguan force-pushed the huggingface_all_options branch from 6b7ec81 to 45536d2 Compare January 24, 2023 15:59
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines +20 to +21
split: Union[str, datasets.Split] = "train",
revision: Union[str, datasets.Version] = "main",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the annotation as str to make sure datasets as optional dependency

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan merged this pull request in f7242a4.

@SvenDS9 SvenDS9 deleted the huggingface_all_options branch February 15, 2023 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Full Support for HuggingFace-Datasets

4 participants