Skip to content

Conversation

@nevi-me
Copy link
Collaborator

@nevi-me nevi-me commented Oct 7, 2020

The default Parquet projection works at a leaf leve, such that the schema below would have 4 fields:

a: Struct<<b: String, c: Int32>
d: List<Float32>
e: Int64
----
leaf 1: a.b
leaf 2: a.c
leaf 3: d
leaf 4: e

By default, when selecting fields 1 and 3, we don't get a and e, but we get a.b and d.
This is often undesirable for users who might want to select a, without knowing that it's spread out into 2 leaf indices.

This adds the option to select fields by their root.

We also read all Arrow types in the roundtrip test. Lists and structs continue to fail, and have been commented out.

carols10cents and others added 8 commits October 5, 2020 14:59
Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).
Previously, if an Arrow schema was present in the Parquet metadata, that
schema would always be returned when requesting all columns via
`parquet_to_arrow_schema` and would never be returned when requesting a
subset of columns via `parquet_to_arrow_schema_by_columns`.

Now, if a valid Arrow schema is present in the Parquet metadata and a
subset of columns is requested by Parquet column index, the
`parquet_to_arrow_schema_by_columns` function will try to find a column
of the same name in the Arrow schema first, and then fall back to the
Parquet schema for that column if there isn't an Arrow Field for that
column.

This is part of what is needed to be able to restore Arrow types like
LargeUtf8 from Parquet.
@nevi-me nevi-me requested a review from carols10cents October 7, 2020 18:17
fn get_schema_by_columns<T>(
&mut self,
column_indices: T,
leaf_columns: bool,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this extra option

self.ptr
}

#[allow(clippy::wrong_self_convention)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an annoying clippy lint, disabled it

.unwrap_or_default();

// add the Arrow metadata to the Parquet metadata
if let Some(arrow_schema) = &arrow_schema_metadata {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we weren't preserving metadata. Arrow keys will now overwrite existing Parquet keys in the metadata

))))),
true,
),
// Field::new(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are still failing roundtrip, so I've disabled them. I'm planning on opening a JIRA to address this.


// read all fields by columns
let partial_read_schema =
arrow_reader.get_schema_by_columns(0..(schema.fields().len()), false)?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carols10cents I changed this to read all columns by root.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I hadn't yet figured out why there were a different number of Parquet columns and Arrow fields; I think I'm starting to understand now :)

@carols10cents
Copy link
Member

I cherry-picked onto schema-roundtrip because I force pushed that 🤭

carols10cents pushed a commit that referenced this pull request Apr 15, 2021
From a deadlocked run...

```
#0  0x00007f8a5d48dccd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f8a5d486f05 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007f8a566e7e89 in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#3  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#4  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#5  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#6  0x00007f8a566e827d in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#7  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#8  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#9  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
apache#10 0x00007f8a566e74b1 in arrow::fs::(anonymous namespace)::TreeWalker::DoWalk() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
```

The callback `ListObjectsV2Handler` is being called recursively and the mutex is non-reentrant thus deadlock.

To fix it I got rid of the mutex on `TreeWalker` by using `arrow::util::internal::TaskGroup` instead of manually tracking the #/status of in-flight requests.

Closes apache#9842 from westonpace/bugfix/arrow-12040

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants