Skip to content

Read partitioned parquet files#632

Merged
devin-petersohn merged 12 commits intomodin-project:masterfrom
williamma12:io/parquet_partitions
May 28, 2019
Merged

Read partitioned parquet files#632
devin-petersohn merged 12 commits intomodin-project:masterfrom
williamma12:io/parquet_partitions

Conversation

@williamma12
Copy link
Copy Markdown
Collaborator

@williamma12 williamma12 commented May 23, 2019

What do these changes do?

Adds special case for partitioned parquet files

Related issue number

Fixes #624 and #633

  • passes flake8 modin
  • passes black --check modin
  • tests added and passing

@codecov
Copy link
Copy Markdown

codecov bot commented May 23, 2019

Codecov Report

Merging #632 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #632      +/-   ##
==========================================
+ Coverage   90.49%   90.52%   +0.03%     
==========================================
  Files          37       37              
  Lines        5563     5584      +21     
==========================================
+ Hits         5034     5055      +21     
  Misses        529      529
Impacted Files Coverage Δ
modin/engines/ray/generic/io.py 94.4% <100%> (+0.39%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 484d36c...830cae2. Read the comment docs.

@ddutt
Copy link
Copy Markdown
Contributor

ddutt commented May 23, 2019

Hi @williamma12,

When I applied your fixes to my 0.5.0 tree, I get a different error (I've turned off memory-map):

RayTaskError: ray_worker (pid=31117, host=ddutt-yoga)
File "/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/engines/ray/pandas_on_ray/io.py", line 55, in _read_parquet_columns
df = pq.read_pandas(path, columns=columns, **kwargs).to_pandas()
File "/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 1144, in read_pandas
use_pandas_metadata=True)
File "/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 1123, in read_table
use_pandas_metadata=use_pandas_metadata)
File "/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/ray/pyarrow_files/pyarrow/filesystem.py", line 181, in read_parquet
use_pandas_metadata=use_pandas_metadata)
File "/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 985, in read
use_pandas_metadata=use_pandas_metadata)
File "/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 535, in read
table = reader.read(**options)
File "/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 211, in read
columns, use_pandas_metadata=use_pandas_metadata)
File "/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 261, in _get_column_indices
for descr in index_columns
File "/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 262, in
if descr['kind'] == 'serialized']
TypeError: string indices must be integers

Copy link
Copy Markdown
Collaborator

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a lot of questions after taking a quick look, let's chat about this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just delete this. It will be fixed when the resolution for #636 is added.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! This will be blocked on #638

@williamma12 williamma12 added the Blocked ❌ A pull request that is blocked label May 24, 2019
@williamma12
Copy link
Copy Markdown
Collaborator Author

Blocked on #638

@devin-petersohn devin-petersohn removed the Blocked ❌ A pull request that is blocked label May 27, 2019
@williamma12 williamma12 force-pushed the io/parquet_partitions branch from be63e5f to 830cae2 Compare May 27, 2019 22:42
Copy link
Copy Markdown
Collaborator

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks @williamma12!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to read entire parquet directory

3 participants