-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
While fuzzing the GDAL Parquet reader with a local run of ossfuzz, I got the following crash in TableBatchReader::ReadNext() on this attached fuzzed parquet file (to be unzipped first) : crash-581a7ec06da982291398aa3f63361ecb69fe20fc.zip
Details
==14==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000063500 at pc 0x000002c7899c bp 0x7ffdf52157f0 sp 0x7ffdf52157e8
READ of size 8 at 0x602000063500 thread T0
SCARINESS: 23 (8-byte-read-heap-buffer-overflow)
#0 0x2c7899b in get /usr/local/bin/../include/c++/v1/__memory/shared_ptr.h:801:16
#1 0x2c7899b in arrow::TableBatchReader::ReadNext(std::__1::shared_ptr*) /src/gdal/arrow/cpp/src/arrow/table.cc:652:60
#2 0x36ca184 in arrow::RecordBatchReader::Next() /src/gdal/arrow/cpp/src/arrow/record_batch.h:302:5
#3 0x78d182e in operator() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1051:58
#4 0x78d182e in Next /src/gdal/arrow/cpp/src/arrow/util/iterator.h:346:29
#5 0x78d182e in arrow::Result > arrow::Iterator >::Next > const&, std::__1::vector > const&, std::__1::unique_ptr >*)::$_1::operator()()::'lambda'(), std::__1::shared_ptr > >(void*) /src/gdal/arrow/cpp/src/arrow/util/iterator.h:200:40
#6 0x78d1fd9 in Next /src/gdal/arrow/cpp/src/arrow/util/iterator.h:110:29
#7 0x78d1fd9 in arrow::FlattenIterator >::Next() /src/gdal/arrow/cpp/src/arrow/util/iterator.h:541:5
#8 0x78d1e67 in arrow::Result > arrow::Iterator >::Next > >(void*) /src/gdal/arrow/cpp/src/arrow/util/iterator.h:200:40
#9 0x78cbe47 in Next /src/gdal/arrow/cpp/src/arrow/util/iterator.h:110:29
#10 0x78cbe47 in parquet::arrow::(anonymous namespace)::RowGroupRecordBatchReader::ReadNext(std::__1::shared_ptr*) /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:392:21
#11 0xa1d0ec in OGRParquetLayer::ReadNextBatch() /src/gdal/ogr/ogrsf_frmts/parquet/ogrparquetlayer.cpp:1782:46
#12 0x9a7f7f in OGRArrowLayer::GetNextRawFeature() /src/gdal/ogr/ogrsf_frmts/parquet/../arrow_common/ograrrowlayer.hpp:3973:19
#13 0x97ffc7 in GetNextFeature /src/gdal/ogr/ogrsf_frmts/ogrsf_frmts.h:466:45
#14 0x97ffc7 in OGRArrowLayer::GetNextFeature() /src/gdal/ogr/ogrsf_frmts/parquet/../arrow_common/ogr_arrow.h:269:5
#15 0xb462d4 in OGR_L_GetNextFeature /src/gdal/ogr/ogrsf_frmts/generic/ogrlayer.cpp:650:63
#16 0x5fd3a0 in LLVMFuzzerTestOneInput /src/gdal/./fuzzers/ogr_fuzzer.cpp:155:30
#17 0x4ceb03 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
#18 0x4ce2ea in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool, bool*) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:514:3
#19 0x4cf9b9 in fuzzer::Fuzzer::MutateAndTestOne() /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:757:19
#20 0x4d0685 in fuzzer::Fuzzer::Loop(std::__Fuzzer::vector >&) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:895:5
#21 0x4bf9ef in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:912:6
#22 0x4e9042 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
#23 0x7f4f72347082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: eebe5d5f4b608b8a53ec446b63981bba373ca0ca)
#24 0x4b042d in _start (/out/parquet_fuzzer+0x4b042d)
DEDUP_TOKEN: get--arrow::TableBatchReader::ReadNext(std::__1::shared_ptr*)--arrow::RecordBatchReader::Next()
0x602000063500 is located 0 bytes to the right of 16-byte region [0x6020000634f0,0x602000063500)
allocated by thread T4 here:
#0 0x5faadd in operator new(unsigned long) /src/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:95:3
#1 0x2c89571 in __libcpp_operator_new /usr/local/bin/../include/c++/v1/new:245:10
#2 0x2c89571 in __libcpp_allocate /usr/local/bin/../include/c++/v1/new:271:10
#3 0x2c89571 in allocate /usr/local/bin/../include/c++/v1/__memory/allocator.h:105:38
#4 0x2c89571 in allocate /usr/local/bin/../include/c++/v1/__memory/allocator_traits.h:262:20
#5 0x2c89571 in __vallocate /usr/local/bin/../include/c++/v1/vector:931:37
#6 0x2c89571 in vector /usr/local/bin/../include/c++/v1/vector:1236:9
#7 0x2c89571 in arrow::ChunkedArray::ChunkedArray(std::__1::shared_ptr) /src/gdal/arrow/cpp/src/arrow/chunked_array.h:80:22
#8 0x7991577 in __shared_ptr_emplace > /usr/local/bin/../include/c++/v1/__memory/shared_ptr.h:294:37
#9 0x7991577 in std::__1::shared_ptr std::__1::allocate_shared, std::__1::shared_ptr, void>(std::__1::allocator const&, std::__1::shared_ptr&&) /usr/local/bin/../include/c++/v1/__memory/shared_ptr.h:953:55
#10 0x7947a65 in make_shared, void> /usr/local/bin/../include/c++/v1/__memory/shared_ptr.h:962:12
#11 0x7947a65 in parquet::arrow::TransferColumnData(parquet::internal::RecordReader*, std::__1::shared_ptr const&, parquet::ColumnDescriptor const*, arrow::MemoryPool*, std::__1::shared_ptr*) /src/gdal/arrow/cpp/src/parquet/arrow/reader_internal.cc:873:12
#12 0x78ac9c3 in parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:488:5
#13 0x78d0155 in NextBatch /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:109:5
#14 0x78d0155 in operator() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9
#15 0x78d0155 in operator()<(lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &, arrow::Status, arrow::Future > /src/gdal/arrow/cpp/src/arrow/util/future.h:150:23
#16 0x78d0155 in __invoke &, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &> /usr/local/bin/../include/c++/v1/type_traits:3592:23
#17 0x78d0155 in __apply_functor, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9), int>, 0UL, 1UL, 2UL, std::__1::tuple<> > /usr/local/bin/../include/c++/v1/__functional/bind.h:263:12
#18 0x78d0155 in operator()<> /usr/local/bin/../include/c++/v1/__functional/bind.h:298:20
#19 0x78d0155 in arrow::internal::FnOnce::FnImpl&, parquet::arrow::(anonymous namespace)::FileReaderImpl::GetRecordBatchReader(std::__1::vector > const&, std::__1::vector > const&, std::__1::unique_ptr >*)::$_1::operator()()::'lambda'(int)&, int&> >::invoke() /src/gdal/arrow/cpp/src/arrow/util/functional.h:152:42
#20 0x66b01f5 in operator() /src/gdal/arrow/cpp/src/arrow/util/functional.h:140:17
#21 0x66b01f5 in WorkerLoop /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:457:11
#22 0x66b01f5 in operator() /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:618:7
#23 0x66b01f5 in __invoke<(lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/type_traits:3592:23
#24 0x66b01f5 in __thread_execute >, (lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/thread:281:5
#25 0x66b01f5 in void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) /usr/local/bin/../include/c++/v1/thread:292:5
#26 0x7f4f72691608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x8608) (BuildId: 0c044ba611aeeeaebb8374e660061f341ebc0bac)
The bug isn't specific of the GDAL integration and can be reproduced with this simple pyarrow.parquet based script:
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('crash-581a7ec06da982291398aa3f63361ecb69fe20fc')
[_ for _ in parquet_file.iter_batches()]Details
==4082742== Invalid read of size 8
==4082742== at 0x71357DB: arrow::TableBatchReader::ReadNext(std::shared_ptr*) (in /home/even/install-arrow/lib/libarrow.so.1500.0.0)
==4082742== by 0xD36D07A: arrow::Result > arrow::Iterator >::Next > const&, std::vector > const&, std::unique_ptr >*)::{lambda()#1}::operator()()::{lambda()#2}, std::shared_ptr > >(void*) (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742== by 0xD381717: arrow::FlattenIterator >::Next() (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742== by 0xD381921: arrow::Result > arrow::Iterator >::Next > >(void*) (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742== by 0xD36F8B1: parquet::arrow::(anonymous namespace)::RowGroupRecordBatchReader::ReadNext(std::shared_ptr*) (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742== by 0xD212735: __pyx_gb_7pyarrow_8_parquet_13ParquetReader_10generator(__pyx_CoroutineObject*, _ts*, _object*) (in /home/even/arrow/python/build/lib.linux-x86_64-3.8/pyarrow/_parquet.cpython-38-x86_64-linux-gnu.so)
==4082742== by 0x6564152: __Pyx_Coroutine_SendEx(__pyx_CoroutineObject*, _object*, int) [clone .isra.0] (in /home/even/arrow/python/build/lib.linux-x86_64-3.8/pyarrow/lib.cpython-38-x86_64-linux-gnu.so)
==4082742== by 0x547634: _PyEval_EvalFrameDefault (in /usr/bin/python3.8)
==4082742== by 0x5D5845: _PyFunction_Vectorcall (in /usr/bin/python3.8)
==4082742== by 0x547264: _PyEval_EvalFrameDefault (in /usr/bin/python3.8)
==4082742== by 0x545529: _PyEval_EvalCodeWithName (in /usr/bin/python3.8)
==4082742== by 0x684326: PyEval_EvalCode (in /usr/bin/python3.8)
==4082742== Address 0x128daad0 is 0 bytes after a block of size 16 alloc'd
==4082742== at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==4082742== by 0xD394AFE: parquet::arrow::TransferColumnData(parquet::internal::RecordReader*, std::shared_ptr const&, parquet::ColumnDescriptor const*, arrow::MemoryPool*, std::shared_ptr*) (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742== by 0xD36FA1A: parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742== by 0xD36D189: arrow::internal::FnOnce::FnImpl, parquet::arrow::(anonymous namespace)::FileReaderImpl::GetRecordBatchReader(std::vector > const&, std::vector > const&, std::unique_ptr >*)::{lambda()#1}::operator()()::{lambda(int)#1}, int)> >::invoke() (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742== by 0x7280716: std::thread::_State_impl > >::_M_run() (in /home/even/install-arrow/lib/libarrow.so.1500.0.0)
==4082742== by 0x8735DF3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==4082742== by 0x4A78608: start_thread (pthread_create.c:477)
==4082742== by 0x499B352: clone (clone.S:95)
==4082742==
Reproducable with v15.0.0 and latest master at time of writing (16e20b7)
I'm not sure if it is related but trying to read with ParquetReader.read_all() triggers a clean exception:
>>> import pyarrow.parquet as pq
>>> parquet_file = pq.ParquetFile('crash-581a7ec06da982291398aa3f63361ecb69fe20fc')
>>> parquet_file.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/even/arrow/python/build/lib.linux-x86_64-3.8/pyarrow/parquet/core.py", line 624, in read
return self.reader.read_all(column_indices=column_indices,
File "pyarrow/_parquet.pyx", line 1675, in pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 18 named timestamp_us_no_tz expected length 5 but got length 2
and another one when using ParquetReader.scan_contents():
>>> import pyarrow.parquet as pq
>>> parquet_file = pq.ParquetFile('crash-581a7ec06da982291398aa3f63361ecb69fe20fc')
>>> parquet_file.scan_contents()
Traceback (most recent call last):
File "test.py", line 4, in <module>
parquet_file.scan_contents()
File "/home/even/arrow/python/build/lib.linux-x86_64-3.8/pyarrow/parquet/core.py", line 662, in scan_contents
return self.reader.scan_contents(column_indices,
File "pyarrow/_parquet.pyx", line 1702, in pyarrow._parquet.ParquetReader.scan_contents
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Parquet error: Total rows among columns do not match
Component(s)
C++, Parquet
mapleFU