Update run_glue for do_predict with local test data (#9442)#9486
Conversation
|
Error messages of the CircleCI are: I'm sorry but I'd like to ask you if |
sgugger
left a comment
There was a problem hiding this comment.
The failures seems spurious indeed. I've left some comment to move the test dataset creation a bit, but it overall looks good to me.
| if data_args.task_name is None and data_args.test_file is not None: | ||
| extension = data_args.test_file.split(".")[-1] | ||
| assert extension in ["csv", "json"], "`test_file` should be a csv or a json file." | ||
| if data_args.test_file.endswith(".csv"): | ||
| # Loading a dataset from a local csv file | ||
| test_dataset = load_dataset("csv", data_files={"test": data_args.test_file}) | ||
| else: | ||
| # Loading a dataset from a local json file | ||
| test_dataset = load_dataset("json", data_files={"test": data_args.test_file}) |
There was a problem hiding this comment.
Can we put those lines earlier, with the validation dataset? This way the map will be done with the other dataset. I think we can do something nice by creating data_files={"train": data_args.train_file, "validation": data_args.validation_file} and then adding the keys test if the test_file is passed.
There was a problem hiding this comment.
Thank you for your comment! I've reflected the review.
The nested if statements in the code have increased, but I think the readability may have improved in terms of "whether to use GLUE task or to use local files". What do you think?
There was a problem hiding this comment.
I also added the logger output to make sure that the local files a user wants to use are loaded correctly. If this is superfluous, please let me know and I will remove it.
sgugger
left a comment
There was a problem hiding this comment.
Looking good! Left two more small comments and it should be good to merge!
| if training_args.do_predict: | ||
| if data_args.test_file is not None: | ||
| extension = data_args.test_file.split(".")[-1] | ||
| assert extension in ["csv", "json"], "`test_file` should be a csv or a json file." |
There was a problem hiding this comment.
The extension will need to be the same one as for the training and validation file, so we should adapt this assert to test that.
There was a problem hiding this comment.
Reflecting the comments, assert now checks that the test file has the same extension as the train file.
Also, I thought there was no check if the validation file has the same extension as the train file, so I modified that. Is this change OK?
| datasets = load_dataset( | ||
| "json", data_files=data_files | ||
| ) |
There was a problem hiding this comment.
| datasets = load_dataset( | |
| "json", data_files=data_files | |
| ) | |
| datasets = load_dataset("json", data_files=data_files) |
Can fit in one line now :-)
There was a problem hiding this comment.
Thank you!
It may be that the old code before applying the auto-format was left in place.
I have applied the auto-format in b2936c3, could you please check if it is fit in one line?
| for key in data_files.keys(): | ||
| logger.info(f"load a local file for {key}: {data_files[key]}") |
There was a problem hiding this comment.
This could info to log, thanks for adding!
LysandreJik
left a comment
There was a problem hiding this comment.
Great, thanks for adding!
|
@sgugger @LysandreJik |
What does this PR do?
Currently, run_glue.py cannot use the test set (
do_predict) unless we give it a GLUE task name.This PR will allow us to use the local test dataset.
As commented in #9442, I tried to achieve the functionality with only simple changes.
--do_predictwith out adding specific params, we will get an error statement saying that we need either the GLUE task name or the path of the local test file.Fixes #9442
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@sgugger
Thank you for your kind comments on the issue.
I have tried to keep it simple and hope there is no problem as an example script.