Skip to content

Draft for a safer HuggingFaceDatasetSaver#3835

Closed
Wauplin wants to merge 22 commits into
gradio-app:mainfrom
Wauplin:safer-hf-dataset-saver
Closed

Draft for a safer HuggingFaceDatasetSaver#3835
Wauplin wants to merge 22 commits into
gradio-app:mainfrom
Wauplin:safer-hf-dataset-saver

Conversation

@Wauplin

@Wauplin Wauplin commented Apr 12, 2023

Copy link
Copy Markdown
Contributor

Description

This PR should fix a few problems with the HF dataset saver callback:

(For the 2 issues in huggingface_hub, I wanted to transfer them to Gradio but I don't have the permissions. Feel free to do it if you can @abidlabs)

Drawbacks:

  • At the moment if multiple Spaces are pushing flagged data to the same repo, it will not work properly (some data might be erased). This is fixable, yet not straightforward.
  • Might be less efficient if the CSV file gets very large. When data is appended, the full file is uploaded. Shouldn't be a problem in most cases running on Spaces.
  • Requires huggingface_hub>=0.12.0 - but shouldn't be a problem.

At the moment I implemented it as a draft for a conceptual approval @abidlabs @freddyaboulton. It is in its standalone SaferHuggingFaceDatasetSaver (as the signature changed a little bit). Please let me know if you are opinionated on how things should be especially because there is already a HuggingFaceDatasetSaver and HuggingFaceDatasetJSONSaver that seems to have quite a lot of code in common. I wanted to clean some stuff there but didn't know if it was appropriate.

Example:

import gradio as gr

hf_writer = gr.SaferHuggingFaceDatasetSaver(HF_API_TOKEN, "image-classification-mistakes")

def image_classifier(inp):
    return {'cat': 0.3, 'dog': 0.7}

gr.Interface(fn=image_classifier, inputs="image", outputs="label", allow_flagging="manual", flagging_callback=hf_writer).launch()

Checklist:

  • I have performed a self-review of my own code
  • I have added a short summary of my change to the CHANGELOG.md
  • My code follows the style guidelines of this project
  • I have commented my code in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@gradio-pr-bot

Copy link
Copy Markdown
Collaborator

All the demos for this PR have been deployed at https://huggingface.co/spaces/gradio-pr-deploys/pr-3835-all-demos

@abidlabs

Copy link
Copy Markdown
Member

Hi @Wauplin thanks for making this PR and apologies for the delay in reviewing. The suggested changes to make the flagging method thread-safe look great!

I would indeed replace the contents of the HuggingFaceDatasetSaver class with the SaferHuggingFaceDatasetSaver class, since we've had issues around thread-safety in the original class. For backwards compatibility, it would be good to add the organization parameter back in (we can make it have no effect) as well as verbose (also no effect). And it would be great if we can apply the same logic to the HuggingFaceDatasetJSONSaver class as well.

Let me know if you'd like me to make these changes, and then happy to review/merge!

@Wauplin

Wauplin commented Apr 17, 2023

Copy link
Copy Markdown
Contributor Author

Good! I'll make the changes and let know you when it's ready for a real review :)

@Wauplin

Wauplin commented Apr 18, 2023

Copy link
Copy Markdown
Contributor Author

@abidlabs I've made the mentioned changes (replace HuggingFaceDatasetSaver, keep same signature with ignore organization).

About the HuggingFaceDatasetJSONSaver, I'm fine with updating it as well with the same logic. But if I'm updating it I might as well refactor it a bit. I do think the naming/implementation is a bit clunky for this one. The name HuggingFaceDatasetJSONSaver implicitly tells that the data will be saved as JSON (instead of CSV). However if I understood correctly the main feature doesn't lie on the file format but on the fact that it's far more thread-safe, using 1 folder per flagged item.

I ran it locally to get an idea of the file structure and got this:

flagged
├── image-classification-mistakes-default-saver
│   ├── data.csv
│   ├── dataset_info.json
│   ├── inp
│   │   ├── tmp08x0gpnm.png
│   │   ├── tmp0zxwfkrx.png
│   └── output
│       ├── tmp63q0d18b.json
│       ├── tmpeyrzxgqy.json
└── image-classification-mistakes-json-saver
    ├── dataset_infos.json
    ├── e8dfcdcd-0991-4296-85bb-9b44e98cf531
    │   ├── inp
    │   │   └── tmp4fs_tu9l.png
    │   ├── metadata.jsonl
    │   └── output
    │       └── tmptcp5nwuk.json
    └── eb193af0-e5a2-4fd1-a279-6b1f4a06e92b
        ├── inp
        │   └── tmpl6omqocc.png
        ├── metadata.jsonl
        └── output
            └── tmpfmhcb354.json
  1. Ideally metadata.jsonl files should be renamed metadata.json as it stores in json format and not jsonlines. Maybe even rename data.json to be consistent with the data.csv of the default saver.
  2. Since most of the code is duplicated, I would ideally define a single HuggingFaceDatasetSaver class with an extra-parameter distributed:bool=False (I'm open for better name suggestions :D). If False, it stays with the default structure. If True, we use the "1 folder per item" structure.
  3. If we implement (2.) I can still define HuggingFaceDatasetJSONSaver for backward compatibility which would be an alias for HuggingFaceDatasetSaver (+add a deprecation warning on it).

@abidlabs What's your opinion on that? I don't want to overstep too much so I'm fine with keeping as it is but if you don't mind I'd prefer to clean this (i.e. rename jsonl files + deduplicate code) :)

@abidlabs

abidlabs commented Apr 18, 2023

Copy link
Copy Markdown
Member

Thanks @Wauplin for the changes and for this opportunity to improve the class! Regarding the points that you brought up:

Ideally metadata.jsonl files should be renamed metadata.json as it stores in json format and not jsonlines. Maybe even rename data.json to be consistent with the data.csv of the default saver.

The purpose of calling it metadata.jsonl is actually to follow the format that the Hugging Face Dataset Previewer expects:

Metadata can also be specified as JSON Lines, in which case use metadata.jsonl as the name of the metadata file. This format is helpful in scenarios when one of the columns is complex, e.g. a list of floats, to avoid parsing errors or reading the complex values as strings.

Giving it another name would prevent the flagged data from showing up automatically in the Dataset Previewer UI on Hugging Face Datasets, so I would keep it with this name.

Since most of the code is duplicated, I would ideally define a single HuggingFaceDatasetSaver class with an extra-parameter distributed:bool=False (I'm open for better name suggestions :D). If False, it stays with the default structure. If True, we use the "1 folder per item" structure.

If we implement (2.) I can still define HuggingFaceDatasetJSONSaver for backward compatibility which would be an alias for HuggingFaceDatasetSaver (+add a deprecation warning on it).

Yup most of the code is duplicated, and it makes more sense to just have this as a parameter in HuggingFaceDatasetSaver while keeping the HuggingFaceDatasetJSONSaver class as an alias for backwards compatbility. There are two changes that the HuggingFaceDatasetJSONSaver implements: the metadata is in jsonl format, and it's 1 directory per flagged item. I agree with you that the more important change is the latter. How about a boolean separate_dir parameter in HuggingFaceDatasetSaver?

Thanks again @Wauplin!

@Wauplin

Wauplin commented Apr 18, 2023

Copy link
Copy Markdown
Contributor Author

Thanks for the precision @abidlabs. Then let's keep a jsonl format, it's a pretty good reason! I'll add a comment about it in the code.

How about a boolean separate_dir parameter in HuggingFaceDatasetSaver?

Yep, let's go for separate_dir :)

@abidlabs

Copy link
Copy Markdown
Member

Thanks @Wauplin, I'll go ahead and make those final changes so that we can merge this in

@Wauplin

Wauplin commented Apr 24, 2023

Copy link
Copy Markdown
Contributor Author

@abidlabs sry for not being responding last week. I've already worked on making those changes (I have some uncommitted changes). I'll keep you update soon (tomorrow?)

@abidlabs

Copy link
Copy Markdown
Member

Oh sorry @Wauplin my bad, I completely missed this message and just pushed my own changes!

@abidlabs

Copy link
Copy Markdown
Member

Don't worry about it, I'll fix up the issues as we discussed above, but I'll keep the PR open so you can also take a look and make sure it looks good to you too

@abidlabs

Copy link
Copy Markdown
Member

Should be ready now @Wauplin! I'll hold off merging until tomorrow so that you have a chance to take a look as well.

@abidlabs

Copy link
Copy Markdown
Member

Will plan to merge in 3-4 hours unless you have feedback @Wauplin!

@Wauplin

Wauplin commented Apr 25, 2023

Copy link
Copy Markdown
Contributor Author

@abidlabs I pushed my ongoing work to a separate branch and made a second PR (#3973) for it. Sorry for the duplicated work on this PR, I didn't know it was urgent to merge it.
The other PR is, IMHO, cleaner in the sense that there is less duplicated logic between the 2 classes. As mentioned in #3835 (comment) I'm not sure it's needed to have 2 different classes that does almost the same.

But in any case, I let you merge it if you want. In the end both are working great :)

Comment thread gradio/flagging.py
@abidlabs

Copy link
Copy Markdown
Member

Reviewing both PRs now!

@abidlabs I pushed my ongoing work to a separate branch and made a second PR (#3973) for it. Sorry for the duplicated work on this PR, I didn't know it was urgent to merge it.

No my bad! There wasn't any real urgency to merge this PR but I just wanted to clean up the open PRs and didn't realize you were already working on it.

abidlabs and others added 2 commits April 26, 2023 01:09
Co-authored-by: Lucain <lucainp@gmail.com>
@abidlabs

Copy link
Copy Markdown
Member

Closing in favor of #3973

@abidlabs abidlabs closed this Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants