-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-2066: [Python] Document using pyarrow with Azure Blob Store #1544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2066: [Python] Document using pyarrow with Azure Blob Store #1544
Conversation
rjrussell77
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm Can you review this? (I had difficulty with getting the formatting down. Adding a sub-bullet list item caused the parent bullet to become italicized. Eventually omitted the sub-bullet.)
|
@xhochy Uwe - can you review? |
python/doc/source/parquet.rst
Outdated
| block_blob_service = BlockBlobService(account_name=account_name, account_key=account_key) | ||
| with tempfile.TemporaryFile() as fp: | ||
| block_blob_service.get_blob_to_stream(container_name=container_name, blob_name=parquet_file, stream=fp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should actually work without a temporary file, see the implementation in simplekv: https://github.com/mbr/simplekv/blob/master/simplekv/net/azurestore.py#L74
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - I'll take a look. Thanks.
python/doc/source/parquet.rst
Outdated
| byte_stream = io.BytesIO() | ||
| block_blob_service.get_blob_to_stream(container_name=container_name, blob_name=parquet_file, stream=byte_stream) | ||
| pd = pq.read_table(source=byte_stream).to_pandas() | ||
| pd.head(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhochy Good feedback - I replaced the temp file buffer with BytesIO stream instead.
| print("Error: {0}".format(err)) | ||
| finally: | ||
| byte_stream.close() | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added try/except/finally block to ensure closure of the stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this comment to the code? That will also be helpful for the reader later.
python/doc/source/parquet.rst
Outdated
| block_blob_service = BlockBlobService(account_name=account_name, account_key=account_key) | ||
| try: | ||
| block_blob_service.get_blob_to_stream(container_name=container_name, blob_name=parquet_file, stream=byte_stream) | ||
| pd = pq.read_table(source=byte_stream).to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result is typically written into a variable called df whereas pd is the abbreviation you use when you import pandas (import pandas as pd)
python/doc/source/parquet.rst
Outdated
| try: | ||
| block_blob_service.get_blob_to_stream(container_name=container_name, blob_name=parquet_file, stream=byte_stream) | ||
| pd = pq.read_table(source=byte_stream).to_pandas() | ||
| pd.head(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better replace this with # Do work on DF …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about, # Do work on df (lower case)?
python/doc/source/parquet.rst
Outdated
| block_blob_service.get_blob_to_stream(container_name=container_name, blob_name=parquet_file, stream=byte_stream) | ||
| pd = pq.read_table(source=byte_stream).to_pandas() | ||
| pd.head(10) | ||
| except Exception as err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't catch exceptions like this, just let it throw.
| print("Error: {0}".format(err)) | ||
| finally: | ||
| byte_stream.close() | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this comment to the code? That will also be helpful for the reader later.
…emove head() call and instead use comment to indicate generic fill-in code, add comment re: stream closure in finally block
| finally: | ||
| # Add finally block to ensure closure of the stream | ||
| byte_stream.close() | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhochy Ok, I've responded to your last set of feedback. How are we looking now?
xhochy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you for writing this up. This will be very helpful for new users.
|
@rjrussell77 do you have an Apache JIRA id so I could assign https://issues.apache.org/jira/browse/ARROW-2066 to you? |
|
@xhochy Regarding your question about my Apache JIRA id - try this: rob_PTL |
|
Thanks @rjrussell77 -- I added you as a contributor and assigned the issue to you |
Original question:
#1510
Improvement story:
https://issues.apache.org/jira/browse/ARROW-2066