Conversation
| @contextmanager | ||
| def urlopen(*args, **kwargs): | ||
| with closing(_urlopen(*args, **kwargs)) as f: | ||
| with closing(urlopen2(*args, **kwargs)) as f: |
There was a problem hiding this comment.
I'm not sure this workaround is used anywhere.
Maybe we can remove this?
There was a problem hiding this comment.
I would not remove this for now. Investigate in another PR.
There was a problem hiding this comment.
NP. I'll leave it there. Just renamed the urlopen variants to try to make it a little bit clear of what is going on.
| compression = None | ||
| content_encoding = None | ||
| try: | ||
| import requests |
There was a problem hiding this comment.
There is no need to check if requests is installed when using a session object b/c requests is needed to create the object, this is slightly simpler than the approach in #17087 and introduces less code to maintain.
| compression = 'gzip' | ||
| reader = BytesIO(req.read()) | ||
| req.close() | ||
| reader, compression = _urlopen(filepath_or_buffer, session=session) |
There was a problem hiding this comment.
compression is not mentioned in the docs and I wonder if session should.
If so we can use the template/Appender approach, just not sure how to do that in a clean way.
There was a problem hiding this comment.
compressionshould be mentioned in the docs. Would be great to add that.- Given my first point,
sessionshould be as well.
And yes, template / Appender approach sounds like a good idea!
There was a problem hiding this comment.
I need some guidance on how to do this when crossing modules, common.py in get_filepath_or_buffer, and parsers.py for all the parsers defined in _parser_params.
The current situation is:
| options/module-function | get_filepath_or_buffer |
_parser_params |
|---|---|---|
filepath_or_buffer |
present | NA |
mode |
present | NA |
encoding |
present | present |
compression |
missing (added in this PR) | present |
session |
missing | present |
It is unclear to me how to fix this. Some questions I have:
- Should the
compressionandencodingdocstring be standardized? - If yes on the question above what is the best strategy for this? Should
encoding,compression, andsessionbe defined incommon.pyand imported inparsers.pyfor appending in_parser_params? Or the other way around?
My guess is that I should do get the common options from parser.py and add it to common.py like:
_common_params = r"""
Parameters
----------
encoding : str, default None
Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
standard encodings
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_
session : requests.Session
object with the a requests session configuration for remote file.
(requires the requests library)
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer' and
`filepath_or_buffer` is path-like, then detect compression from the
following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no
decompression). If using 'zip', the ZIP file must contain only one data
file to be read in. Set to None for no decompression.
.. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.
"""Then compose the get_filepath_or_buffer with the missing filepath_or_buffer and mode, and later import _common_params in parsers.py to compose _parser_params. Does that sound OK?
PS: if that is correct I'd prefer if a pandas doc expert did this instead of me, I'm kind of lost in the docstrings and Appenders 😄
pandas/io/common.py
Outdated
| r = session.get(url) | ||
| else: | ||
| r = requests.get(url) | ||
| r.raise_for_status |
There was a problem hiding this comment.
It will raise an exception if there is any issue reading the URL (any non-200 OK status).
There was a problem hiding this comment.
But don't you need to actually call the function? i.e. r.raise_for_status() (with parentheses)
There was a problem hiding this comment.
Yep. I totally forgot that when copying-and-pasting 😄
Codecov Report
@@ Coverage Diff @@
## master #21504 +/- ##
==========================================
- Coverage 92.29% 92.01% -0.29%
==========================================
Files 161 169 +8
Lines 51497 50797 -700
==========================================
- Hits 47530 46740 -790
- Misses 3967 4057 +90
Continue to review full report at Codecov.
|
|
Tested this on a username/password protected CSV endpoint and it's working! import requests
session = requests.sessions.Session()
session.auth = ('Nope', 'NoWay!')
url = 'https://cgoms.coas.oregonstate.edu/erddap/tabledap/CE01ISSM-BUOY-001-GPS.csv'
import pandas as pd
df = pd.read_csv(url, session=session) |
|
Hello @ocefpaf! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 25, 2018 at 15:30 Hours UTC |
|
|
||
| if not isinstance(io, ExcelFile): | ||
| io = ExcelFile(io, engine=engine) | ||
| session = kwds.get('session', None) |
There was a problem hiding this comment.
just list session as a kwarg in read_excel (and in ExcelFile), then just pass it in
| compression = _infer_compression(path_or_buf, compression) | ||
| filepath_or_buffer, _, compression, should_close = get_filepath_or_buffer( | ||
| path_or_buf, encoding=encoding, compression=compression, | ||
| session=session, |
There was a problem hiding this comment.
add a versionadded tag when you add session
| option can improve performance because there is no longer any I/O overhead. | ||
| session : requests.Session | ||
| object with the a requests session configuration for remote file. | ||
| (requires the requests library) |
|
@ocefpaf can you address latest comments and merge in changes from master? |
I'm in transit now but I'll have some time next week to check this. However, if someone wants to take it over please do. I do recall asking about a documenting strategy to avoid the repetition and the comments above. It would seems that to address some of the comments I would touch current docs, and I'm not really comfortable with that. I'll re-visit this soon... |
|
Can you merge master? |
Done! But I'm not sure of I got it right, git rebases are not my forte 😬 BTW I'm happy to iterate on the docs questions in https://github.com/pandas-dev/pandas/pull/21504/files#r196079159 but I'm quite confused where I should start (it is my first time contributing here). |
|
You are better off doing the following on your branch locally: git fetch upstream
git merge upstream/masterThat retains the history of comments on GitHub so makes it easier to track. Looks like you have some test failures with the latest push - if you could take a look would be great |
That is what I did :-/
I'll take a look ASAP. |
Hmm OK. Saw the force push here so that's why I thought you had done something else. Shouldn't need the |
|
@ocefpaf CI is red, can you fix any problem, and push again so there are no tests failures. |
I believe that when I first submitted this PR the |
|
Not sure what the problem is. I'll close this for now, as we can't merge with the conflicts and the CI in red, but if you find a solution for what you said, feel free to reopen. |
Would you be interested in a Python 3 only version of this PR? That is working and I guess that all the Python 2.7 code will be removed soon anyway. |
|
@ocefpaf we drop Py2 support for 0.25 so sure I think Python 3 only is fine for that milestone onwards |
git diff upstream/master -u -- "*.py" | flake8 --diffAlternative to #17087
I'll write the whatsnew entry, tests, and the session option for
html,json, andexcelif this path is OK with the devs.Ping @skynss who is the original author of #17087 and @gfyoung who reviewed it.
Note that the main difference in this approach is that I made the option slightly simpler by allowing only the
sessioninstead ofhttp_params.You can see an example of this in action in:
http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/ocefpaf/d6e9ab2c3569ff8fa181fc7885b6524d/raw/5d868fd4cbe2b81f84ccab7760b80251ef6e4651/pandas_test.ipynb