-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
While fuzzing the GDAL Parquet reader with a local run of ossfuzz, I got the following crash in ByteArrayChunkedRecordReader::ReadValuesSpaced() on this attached fuzzed parquet file (to be unzipped first) : crash-34fd88d625cc5fef893bcba62aad402883d98f47.zip
Details
==14==ERROR: AddressSanitizer: heap-use-after-free on address 0x60f000046e58 at pc 0x000007a43e97 bp 0x7f926c00d7e0 sp 0x7f926c00d7d8
READ of size 8 at 0x60f000046e58 thread T6
SCARINESS: 51 (8-byte-read-heap-use-after-free)
#0 0x7a43e96 in parquet::internal::(anonymous namespace)::ByteArrayChunkedRecordReader::ReadValuesSpaced(long, long) /src/gdal/arrow/cpp/src/parquet/column_reader.cc:2135:51
#1 0x7a3d3e4 in ReadSpacedForOptionalOrRepeated /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1914:5
#2 0x7a3d3e4 in ReadOptionalRecords /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1870:7
#3 0x7a3d3e4 in parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecordData(long) /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1940:22
#4 0x7a1fce0 in parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecords(long) /src/gdal/arrow/cpp/src/parquet/column_reader.cc
#5 0x78acf06 in parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:482:46
#6 0x78d07a5 in NextBatch /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:109:5
#7 0x78d07a5 in operator() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9
#8 0x78d07a5 in operator()<(lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &, arrow::Status, arrow::Future > /src/gdal/arrow/cpp/src/arrow/util/future.h:150:23
#9 0x78d07a5 in __invoke &, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &> /usr/local/bin/../include/c++/v1/type_traits:3592:23
#10 0x78d07a5 in __apply_functor, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9), int>, 0UL, 1UL, 2UL, std::__1::tuple<> > /usr/local/bin/../include/c++/v1/__functional/bind.h:263:12
#11 0x78d07a5 in operator()<> /usr/local/bin/../include/c++/v1/__functional/bind.h:298:20
#12 0x78d07a5 in arrow::internal::FnOnce::FnImpl&, parquet::arrow::(anonymous namespace)::FileReaderImpl::GetRecordBatchReader(std::__1::vector > const&, std::__1::vector > const&, std::__1::unique_ptr >*)::$_1::operator()()::'lambda'(int)&, int&> >::invoke() /src/gdal/arrow/cpp/src/arrow/util/functional.h:152:42
#13 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/functional.h:140:17
#14 0x66b0845 in WorkerLoop /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:457:11
#15 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:618:7
#16 0x66b0845 in __invoke<(lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/type_traits:3592:23
#17 0x66b0845 in __thread_execute >, (lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/thread:281:5
#18 0x66b0845 in void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) /usr/local/bin/../include/c++/v1/thread:292:5
#19 0x7f9272659608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x8608) (BuildId: 0c044ba611aeeeaebb8374e660061f341ebc0bac)
#20 0x7f927240a352 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x11f352) (BuildId: eebe5d5f4b608b8a53ec446b63981bba373ca0ca)
DEDUP_TOKEN: parquet::internal::(anonymous namespace)::ByteArrayChunkedRecordReader::ReadValuesSpaced(long, long)--ReadSpacedForOptionalOrRepeated--ReadOptionalRecords
0x60f000046e58 is located 168 bytes inside of 176-byte region [0x60f000046db0,0x60f000046e60)
freed by thread T6 here:
#0 0x5fb33d in operator delete(void*) /src/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:152:3
#1 0x7a245a9 in operator() /usr/local/bin/../include/c++/v1/__memory/unique_ptr.h:53:5
#2 0x7a245a9 in reset /usr/local/bin/../include/c++/v1/__memory/unique_ptr.h:314:7
#3 0x7a245a9 in ~unique_ptr /usr/local/bin/../include/c++/v1/__memory/unique_ptr.h:268:19
#4 0x7a245a9 in ~pair /usr/local/bin/../include/c++/v1/__utility/pair.h:40:29
#5 0x7a245a9 in destroy >, std::__1::default_delete > > > >, void, void> /usr/local/bin/../include/c++/v1/__memory/allocator_traits.h:319:15
#6 0x7a245a9 in __deallocate_node /usr/local/bin/../include/c++/v1/__hash_table:1572:9
#7 0x7a245a9 in clear /usr/local/bin/../include/c++/v1/__hash_table:1818:9
#8 0x7a245a9 in clear /usr/local/bin/../include/c++/v1/unordered_map:1346:42
#9 0x7a245a9 in ResetDecoders /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1810:42
#10 0x7a245a9 in SetPageReader /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1802:5
#11 0x7a245a9 in virtual thunk to parquet::internal::(anonymous namespace)::TypedRecordReader >::SetPageReader(std::__1::unique_ptr >) /src/gdal/arrow/cpp/src/parquet/column_reader.cc
#12 0x78abf1d in parquet::arrow::(anonymous namespace)::LeafReader::NextRowGroup() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:506:21
#13 0x78acf3e in parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:485:9
#14 0x78d07a5 in NextBatch /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:109:5
#15 0x78d07a5 in operator() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9
#16 0x78d07a5 in operator()<(lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &, arrow::Status, arrow::Future > /src/gdal/arrow/cpp/src/arrow/util/future.h:150:23
#17 0x78d07a5 in __invoke &, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &> /usr/local/bin/../include/c++/v1/type_traits:3592:23
#18 0x78d07a5 in __apply_functor, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9), int>, 0UL, 1UL, 2UL, std::__1::tuple<> > /usr/local/bin/../include/c++/v1/__functional/bind.h:263:12
#19 0x78d07a5 in operator()<> /usr/local/bin/../include/c++/v1/__functional/bind.h:298:20
#20 0x78d07a5 in arrow::internal::FnOnce::FnImpl&, parquet::arrow::(anonymous namespace)::FileReaderImpl::GetRecordBatchReader(std::__1::vector > const&, std::__1::vector > const&, std::__1::unique_ptr >*)::$_1::operator()()::'lambda'(int)&, int&> >::invoke() /src/gdal/arrow/cpp/src/arrow/util/functional.h:152:42
#21 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/functional.h:140:17
#22 0x66b0845 in WorkerLoop /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:457:11
#23 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:618:7
#24 0x66b0845 in __invoke<(lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/type_traits:3592:23
#25 0x66b0845 in __thread_execute >, (lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/thread:281:5
#26 0x66b0845 in void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) /usr/local/bin/../include/c++/v1/thread:292:5
#27 0x7f9272659608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x8608) (BuildId: 0c044ba611aeeeaebb8374e660061f341ebc0bac)
The bug isn't specific of the GDAL integration and can be reproduced with this simple pyarrow.parquet based script:
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('crash-34fd88d625cc5fef893bcba62aad402883d98f47')
parquet_file.read()Details
==4171200== Invalid read of size 8
==4171200== at 0xFBFDE22: parquet::internal::(anonymous namespace)::ByteArrayChunkedRecordReader::ReadValuesSpaced(long, long) (column_reader.cc:2180)
==4171200== by 0xFC472DE: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadSpacedForOptionalOrRepeated(long, long*, long*) (column_reader.cc:1957)
==4171200== by 0xFC3A479: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadOptionalRecords(long, long*, long*) (column_reader.cc:1910)
==4171200== by 0xFC31E31: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecordData(long) (column_reader.cc:1983)
==4171200== by 0xFC2334E: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecords(long) (column_reader.cc:1453)
==4171200== by 0xFB0D0D1: parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) (reader.cc:482)
==4171200== by 0xFB23E38: parquet::arrow::ColumnReaderImpl::NextBatch(long, std::shared_ptr*) (reader.cc:109)
==4171200== by 0xFB0B951: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::vector > const&, parquet::arrow::ColumnReader*, std::shared_ptr*) (reader.cc:284)
==4171200== by 0xFB12A48: parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}::operator()(unsigned long, std::shared_ptr) const (reader.cc:1253)
==4171200== by 0xFB216A8: std::enable_if<((!std::is_void > >::value)&&(!arrow::detail::is_future::value))&&((!arrow::Future::is_empty)||std::is_same::value), void>::type arrow::detail::ContinueFuture::operator(), std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&, arrow::Result >, arrow::Future >(arrow::detail::is_future, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&) const (future.h:150)
==4171200== by 0xFB21166: void std::__invoke_impl >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&>(std::__invoke_other, arrow::detail::ContinueFuture&, arrow::Future >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&) (invoke.h:60)
==4171200== by 0xFB20A0A: std::__invoke_result >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&>::type std::__invoke >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&>(std::__invoke_result&&, (arrow::detail::ContinueFuture&)...) (invoke.h:95)
==4171200== Address 0x2b08fff8 is 168 bytes inside a block of size 176 free'd
==4171200== at 0x483D1CF: operator delete(void*, unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==4171200== by 0xFD047C2: parquet::(anonymous namespace)::DictByteArrayDecoderImpl::~DictByteArrayDecoderImpl() (encoding.cc:1887)
==4171200== by 0xFC59125: std::default_delete > >::operator()(parquet::TypedDecoder >*) const (unique_ptr.h:81)
==4171200== by 0xFC58709: std::unique_ptr >, std::default_delete > > >::~unique_ptr() (unique_ptr.h:292)
==4171200== by 0xFC57B73: std::pair >, std::default_delete > > > >::~pair() (stl_pair.h:208)
==4171200== by 0xFC57B97: void __gnu_cxx::new_allocator >, std::default_delete > > > >, false> >::destroy >, std::default_delete > > > > >(std::pair >, std::default_delete > > > >*) (new_allocator.h:152)
==4171200== by 0xFC56896: void std::allocator_traits >, std::default_delete > > > >, false> > >::destroy >, std::default_delete > > > > >(std::allocator >, std::default_delete > > > >, false> >&, std::pair >, std::default_delete > > > >*) (alloc_traits.h:496)
==4171200== by 0xFC55418: std::__detail::_Hashtable_alloc >, std::default_delete > > > >, false> > >::_M_deallocate_node(std::__detail::_Hash_node >, std::default_delete > > > >, false>*) (hashtable_policy.h:2102)
==4171200== by 0xFC53C9D: std::__detail::_Hashtable_alloc >, std::default_delete > > > >, false> > >::_M_deallocate_nodes(std::__detail::_Hash_node >, std::default_delete > > > >, false>*) (hashtable_policy.h:2124)
==4171200== by 0xFC51DA1: std::_Hashtable >, std::default_delete > > > >, std::allocator >, std::default_delete > > > > >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits >::clear() (hashtable.h:2063)
==4171200== by 0xFC621AF: std::unordered_map >, std::default_delete > > >, std::hash, std::equal_to, std::allocator >, std::default_delete > > > > > >::clear() (unordered_map.h:844)
==4171200== by 0xFC334DD: parquet::internal::(anonymous namespace)::TypedRecordReader >::ResetDecoders() (column_reader.cc:1850)
==4171200== Block was alloc'd at
==4171200== at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==4171200== by 0xFCF657B: std::_MakeUniq::__single_object std::make_unique(parquet::ColumnDescriptor const*&, arrow::MemoryPool*&) (unique_ptr.h:857)
==4171200== by 0xFCE9BF8: parquet::detail::MakeDictDecoder(parquet::Type::type, parquet::ColumnDescriptor const*, arrow::MemoryPool*) (encoding.cc:3865)
==4171200== by 0xFC64D07: std::unique_ptr >, std::default_delete > > > parquet::MakeDictDecoder >(parquet::ColumnDescriptor const*, arrow::MemoryPool*) (encoding.h:456)
==4171200== by 0xFC46045: parquet::(anonymous namespace)::ColumnReaderImplBase >::ConfigureDictionary(parquet::DictionaryPage const*) (column_reader.cc:772)
==4171200== by 0xFC39F2F: parquet::(anonymous namespace)::ColumnReaderImplBase >::ReadNewPage() (column_reader.cc:727)
==4171200== by 0xFC316F6: parquet::(anonymous namespace)::ColumnReaderImplBase >::HasNextInternal() (column_reader.cc:700)
==4171200== by 0xFC230ED: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecords(long) (column_reader.cc:1409)
==4171200== by 0xFB0D0D1: parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) (reader.cc:482)
==4171200== by 0xFB23E38: parquet::arrow::ColumnReaderImpl::NextBatch(long, std::shared_ptr*) (reader.cc:109)
==4171200== by 0xFB0B951: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::vector > const&, parquet::arrow::ColumnReader*, std::shared_ptr*) (reader.cc:284)
==4171200== by 0xFB12A48: parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}::operator()(unsigned long, std::shared_ptr) const (reader.cc:1253)
==4171200==
pure virtual method called
terminate called without an active exception
ParquetReader.scan_contents() detects an error, so there's likely a missing validation in the code path followed by DecodeRowGroups() (the fix I propose in #41320 (comment) doesn't help):
Traceback (most recent call last):
File "test.py", line 6, in <module>
parquet_file.scan_contents()
File "/home/even/arrow/python/build/lib.linux-x86_64-3.8/pyarrow/parquet/core.py", line 662, in scan_contents
return self.reader.scan_contents(column_indices,
File "pyarrow/_parquet.pyx", line 1702, in pyarrow._parquet.ParquetReader.scan_contents
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Invalid or corrupted bit_width 254. Maximum allowed is 32.
Component(s)
C++, Parquet