Draft for a safer HuggingFaceDatasetSaver#3835
Conversation
|
All the demos for this PR have been deployed at https://huggingface.co/spaces/gradio-pr-deploys/pr-3835-all-demos |
|
Hi @Wauplin thanks for making this PR and apologies for the delay in reviewing. The suggested changes to make the flagging method thread-safe look great! I would indeed replace the contents of the Let me know if you'd like me to make these changes, and then happy to review/merge! |
|
Good! I'll make the changes and let know you when it's ready for a real review :) |
|
@abidlabs I've made the mentioned changes (replace About the I ran it locally to get an idea of the file structure and got this:
@abidlabs What's your opinion on that? I don't want to overstep too much so I'm fine with keeping as it is but if you don't mind I'd prefer to clean this (i.e. rename jsonl files + deduplicate code) :) |
|
Thanks @Wauplin for the changes and for this opportunity to improve the class! Regarding the points that you brought up:
The purpose of calling it
Giving it another name would prevent the flagged data from showing up automatically in the Dataset Previewer UI on Hugging Face Datasets, so I would keep it with this name.
Yup most of the code is duplicated, and it makes more sense to just have this as a parameter in Thanks again @Wauplin! |
|
Thanks for the precision @abidlabs. Then let's keep a jsonl format, it's a pretty good reason! I'll add a comment about it in the code.
Yep, let's go for |
|
Thanks @Wauplin, I'll go ahead and make those final changes so that we can merge this in |
|
@abidlabs sry for not being responding last week. I've already worked on making those changes (I have some uncommitted changes). I'll keep you update soon (tomorrow?) |
|
Oh sorry @Wauplin my bad, I completely missed this message and just pushed my own changes! |
|
Don't worry about it, I'll fix up the issues as we discussed above, but I'll keep the PR open so you can also take a look and make sure it looks good to you too |
|
Should be ready now @Wauplin! I'll hold off merging until tomorrow so that you have a chance to take a look as well. |
|
Will plan to merge in 3-4 hours unless you have feedback @Wauplin! |
|
@abidlabs I pushed my ongoing work to a separate branch and made a second PR (#3973) for it. Sorry for the duplicated work on this PR, I didn't know it was urgent to merge it. But in any case, I let you merge it if you want. In the end both are working great :) |
|
Reviewing both PRs now!
No my bad! There wasn't any real urgency to merge this PR but I just wanted to clean up the open PRs and didn't realize you were already working on it. |
Co-authored-by: Lucain <lucainp@gmail.com>
|
Closing in favor of #3973 |
Description
This PR should fix a few problems with the HF dataset saver callback:
hf_tokenshould be optional (not needed if already logged in).(For the 2 issues in
huggingface_hub, I wanted to transfer them to Gradio but I don't have the permissions. Feel free to do it if you can @abidlabs)Drawbacks:
huggingface_hub>=0.12.0- but shouldn't be a problem.At the moment I implemented it as a draft for a conceptual approval @abidlabs @freddyaboulton. It is in its standalone
SaferHuggingFaceDatasetSaver(as the signature changed a little bit). Please let me know if you are opinionated on how things should be especially because there is already aHuggingFaceDatasetSaverandHuggingFaceDatasetJSONSaverthat seems to have quite a lot of code in common. I wanted to clean some stuff there but didn't know if it was appropriate.Example:
Checklist: