add option to project root columns from schema #2

nevi-me · 2020-10-07T18:17:51Z

The default Parquet projection works at a leaf leve, such that the schema below would have 4 fields:

a: Struct<<b: String, c: Int32>
d: List<Float32>
e: Int64
----
leaf 1: a.b
leaf 2: a.c
leaf 3: d
leaf 4: e

By default, when selecting fields 1 and 3, we don't get a and e, but we get a.b and d.
This is often undesirable for users who might want to select a, without knowing that it's spread out into 2 leaf indices.

This adds the option to select fields by their root.

We also read all Arrow types in the roundtrip test. Lists and structs continue to fail, and have been commented out.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match).

Previously, if an Arrow schema was present in the Parquet metadata, that schema would always be returned when requesting all columns via `parquet_to_arrow_schema` and would never be returned when requesting a subset of columns via `parquet_to_arrow_schema_by_columns`. Now, if a valid Arrow schema is present in the Parquet metadata and a subset of columns is requested by Parquet column index, the `parquet_to_arrow_schema_by_columns` function will try to find a column of the same name in the Arrow schema first, and then fall back to the Parquet schema for that column if there isn't an Arrow Field for that column. This is part of what is needed to be able to restore Arrow types like LargeUtf8 from Parquet.

nevi-me · 2020-10-07T18:18:24Z

rust/parquet/src/arrow/arrow_reader.rs

+    fn get_schema_by_columns<T>(
+        &mut self,
+        column_indices: T,
+        leaf_columns: bool,


added this extra option

nevi-me · 2020-10-07T18:18:58Z

rust/parquet/src/arrow/record_reader.rs

        self.ptr
    }

+    #[allow(clippy::wrong_self_convention)]


an annoying clippy lint, disabled it

nevi-me · 2020-10-07T18:19:37Z

rust/parquet/src/arrow/schema.rs

        .unwrap_or_default();

+    // add the Arrow metadata to the Parquet metadata
+    if let Some(arrow_schema) = &arrow_schema_metadata {


we weren't preserving metadata. Arrow keys will now overwrite existing Parquet keys in the metadata

nevi-me · 2020-10-07T18:20:25Z

rust/parquet/src/arrow/schema.rs

-                    ))))),
-                    true,
-                ),
+                // Field::new(


These are still failing roundtrip, so I've disabled them. I'm planning on opening a JIRA to address this.

nevi-me · 2020-10-07T18:21:19Z

rust/parquet/src/arrow/schema.rs

+
+        // read all fields by columns
+        let partial_read_schema =
+            arrow_reader.get_schema_by_columns(0..(schema.fields().len()), false)?;


@carols10cents I changed this to read all columns by root.

Nice! I hadn't yet figured out why there were a different number of Parquet columns and Arrow fields; I think I'm starting to understand now :)

carols10cents · 2020-10-07T18:58:34Z

I cherry-picked onto schema-roundtrip because I force pushed that 🤭

From a deadlocked run... ``` #0 0x00007f8a5d48dccd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f8a5d486f05 in pthread_mutex_lock () from /lib64/libpthread.so.0 #2 0x00007f8a566e7e89 in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #3 0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #4 0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so #5 0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#6 0x00007f8a566e827d in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#7 0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#8 0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#9 0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so apache#10 0x00007f8a566e74b1 in arrow::fs::(anonymous namespace)::TreeWalker::DoWalk() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so ``` The callback `ListObjectsV2Handler` is being called recursively and the mutex is non-reentrant thus deadlock. To fix it I got rid of the mutex on `TreeWalker` by using `arrow::util::internal::TaskGroup` instead of manually tracking the #/status of in-flight requests. Closes apache#9842 from westonpace/bugfix/arrow-12040 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

carols10cents and others added 8 commits October 5, 2020 14:59

fix a few failing roundtrip tests

2563586

Explicitly don't support Float16

cfd366b

Update to use new Iterator support by calling next instead of next_batch

8fe210b

Remove unused import

c3f3597

run cargo +stable fmt (and clippy)

f65b2ba

add option to project root columns from schema

55a049b

nevi-me requested a review from carols10cents October 7, 2020 18:17

nevi-me commented Oct 7, 2020

View reviewed changes

carols10cents force-pushed the schema-roundtrip branch from f65b2ba to 30e3e41 Compare October 7, 2020 18:25

carols10cents closed this Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add option to project root columns from schema #2

add option to project root columns from schema #2

Uh oh!

nevi-me commented Oct 7, 2020

Uh oh!

nevi-me Oct 7, 2020

Uh oh!

nevi-me Oct 7, 2020

Uh oh!

nevi-me Oct 7, 2020

Uh oh!

nevi-me Oct 7, 2020

Uh oh!

nevi-me Oct 7, 2020

Uh oh!

carols10cents Oct 7, 2020

Uh oh!

carols10cents commented Oct 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add option to project root columns from schema #2

add option to project root columns from schema #2

Uh oh!

Conversation

nevi-me commented Oct 7, 2020

Uh oh!

nevi-me Oct 7, 2020

Choose a reason for hiding this comment

Uh oh!

nevi-me Oct 7, 2020

Choose a reason for hiding this comment

Uh oh!

nevi-me Oct 7, 2020

Choose a reason for hiding this comment

Uh oh!

nevi-me Oct 7, 2020

Choose a reason for hiding this comment

Uh oh!

nevi-me Oct 7, 2020

Choose a reason for hiding this comment

Uh oh!

carols10cents Oct 7, 2020

Choose a reason for hiding this comment

Uh oh!

carols10cents commented Oct 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants