Exception handling in online datapipes#968
Exception handling in online datapipes#968SvenDS9 wants to merge 9 commits intometa-pytorch:mainfrom
Conversation
|
While working on this, I noticed a few things: https://github.com/pytorch/data/blob/2ca1fa6483e58c6428319393e1aab4c26f576bec/torchdata/datapipes/iter/load/online.py#L40-L53 In addition we convert the exceptions from |
|
@SvenDS9 Thanks for spotting these!
|
Also add some tests Remove unnecessary code
|
Thanks again for your input!
I have also added a few tests and added query_parameters to |
ejguan
left a comment
There was a problem hiding this comment.
Overall LGTM with one comment
| try: | ||
| parts = urllib.parse.urlparse(url) | ||
|
|
||
| if re.match(r"(drive|docs)[.]google[.]com", parts.netloc): | ||
| yield _get_response_from_google_drive(url, timeout=self.timeout, **self.query_params) | ||
| else: | ||
| yield _get_response_from_http(url, timeout=self.timeout, **self.query_params) |
There was a problem hiding this comment.
Can you please wrap try-except around _get_response_from_google_drive or _get_response_from_http?
In your current implementation, there is a chance the skipped Error comes from parts = urllib.parse.urlparse(url) or re.match(r"(drive|docs)[.]google[.]com", parts.netloc).
There was a problem hiding this comment.
I will import and merge this after this change. Thanks!
There was a problem hiding this comment.
I have implemented this change, but I don't really understand why it is necessary. In my opinion the source of the exception doesn't really matter if we want to skip over them anyway. With this change exceptions caused by trying to parse the url will not be caught.
There was a problem hiding this comment.
With this change exceptions caused by trying to parse the url will not be caught.
I think that is the point. If the URL cannot be parsed, perhaps users want to know and fix it. If you cannot get a response, then they may want to skip it.
| try: | ||
| parts = urllib.parse.urlparse(url) | ||
|
|
||
| if re.match(r"(drive|docs)[.]google[.]com", parts.netloc): | ||
| yield _get_response_from_google_drive(url, timeout=self.timeout, **self.query_params) | ||
| else: | ||
| yield _get_response_from_http(url, timeout=self.timeout, **self.query_params) |
There was a problem hiding this comment.
With this change exceptions caused by trying to parse the url will not be caught.
I think that is the point. If the URL cannot be parsed, perhaps users want to know and fix it. If you cannot get a response, then they may want to skip it.
|
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Fixes #963
Changes
HTTPReaderto skip over urls causing problems