Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jul 22, 2019

These are like Binary and String respectively, except with 64-bit offsets so as to allow extremely large individual values.

Subprojects (Gandiva, Parquet) as well as other bindings and implementations (e.g. Python) will have to be updated separately.

Marius Seritan and others added 30 commits July 13, 2019 12:01
Dependent crates may not want the rustyline dependency, specially since
the nightly support seems to be custom. Introduce a "cli" feature to
allow consumers to not bring in the cli depedencies.

Author: Marius Seritan <github@winding-lines.com>

Closes apache#4742 from winding-lines/master and squashes the following commits:

2331587 <Marius Seritan> Make the datafusion cli optional Dependent crates may not want the rustyline dependency, specially since the nightly support seems to be custom. Introduce a "cli" feature to allow consumers to not bring in the cli depedencies.
Because we don't publish modules to crates.io yet.

Author: Sutou Kouhei <kou@clear-code.com>

Closes apache#4747 from kou/release-verify-rust and squashes the following commits:

c424af6 <Sutou Kouhei>  Use local modules to verify RC
…lease/03-binary.sh

Author: Sutou Kouhei <kou@clear-code.com>

Closes apache#4753 from kou/release-upload-binary-avoid-known-host-duplication and squashes the following commits:

7fc0018 <Sutou Kouhei>  Avoid duplicated known host SSH error in dev/release/03-binary.sh
Author: Sutou Kouhei <kou@clear-code.com>

Closes apache#4754 from kou/release-skip-uploaded-binary and squashes the following commits:

e8cd528 <Sutou Kouhei>  Skip already uploaded file
Author: Sutou Kouhei <kou@clear-code.com>

Closes apache#4755 from kou/release-add-missing-wait and squashes the following commits:

081c1aa <Sutou Kouhei>  Add missing waits on uploading binaries
libplasma-glib-doc and libgandiva-glib-doc are unavailable
unexpetedly. It should be fixed in the next release.

Author: Sutou Kouhei <kou@clear-code.com>

Closes apache#4756 from kou/release-verify-apt-update-packages and squashes the following commits:

d95f70d <Sutou Kouhei>  Update expected package list
Author: Sutou Kouhei <kou@clear-code.com>

Closes apache#4758 from kou/release-verify-apt-update-distributions and squashes the following commits:

6a2af8d <Sutou Kouhei>  Update supported distributions
C++ and Python Tests pass locally so this seems to be ok for us.

Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#4752 from xhochy/ARROW-5609 and squashes the following commits:

6e087d6 <Uwe L. Korn> ARROW-5609:  Set CMP0068 CMake policy to avoid macOS warnings
But really 32768 columns should be enough for anyone :)

Author: Micah Kornfield <emkornfield@gmail.com>

Closes apache#4762 from emkornfield/csv and squashes the following commits:

ab0504c <Micah Kornfield> lower number of columns in test to satisfy ming
8f53a8a <Micah Kornfield> remove test
acfe2d8 <Micah Kornfield> remove cap, make min rows_in_chunk 512
08ddc22 <Micah Kornfield> remove floor duplication
211472a <Micah Kornfield> powers of 2 are better
b91a9e1 <Micah Kornfield> ARROW-5791:  Fix infinite loop with more the 32768 columns.  Cap max columns
External shell scripts may refer unbound variable:

    /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh:
    line 55: PS1: unbound variable

Author: Sutou Kouhei <kou@clear-code.com>

Closes apache#4773 from kou/release-verify-without-u and squashes the following commits:

e272703 <Sutou Kouhei>  Remove undefined variable check from verify script
…didate.sh

This is a temporary fix

Author: Wes McKinney <wesm+git@apache.org>

Closes apache#4768 from wesm/rc-no-curl-download-background and squashes the following commits:

171f92e <Wes McKinney> Do not curl in background
Author: Antoine Pitrou <antoine@python.org>

Closes apache#4767 from pitrou/ARROW-5564-conda-uriparser and squashes the following commits:

3c422c9 <Antoine Pitrou> ARROW-5564:  Use uriparser from conda-forge
- Add utility methods for unaligned loads use where errors
  are discovered.
- Upgrade version of flatbuffers to avoid issues with unaligned
  load in that library
- Discover bug in spec that makes zero-copy well defined behavior
  virtually impossible with flatbuffers (need to discuss on ML).  For now I'm
  not turning on ASAN and will file a follow-up JIRA to track this.

Still needed:
 - [ ] Performance testing
 - [X] Discuss flatbuffers issues (I sent e-mail to LM)

Author: Micah Kornfield <emkornfield@gmail.com>
Author: emkornfield <emkornfield@gmail.com>

Closes apache#4757 from emkornfield/ubsan_mem and squashes the following commits:

5528584 <emkornfield> remove TODO
db49fbb <Micah Kornfield> Ubsan excluding flatbuffers
…h too long

Domain sockets have a platform-dependent path length limit. The release verification script on OSX tends to set a temporary directory that makes the test exceed this. Rather than hardcoding `/tmp` or some other directory, we skip the test instead.

Author: David Li <li.davidm96@gmail.com>

Closes apache#4793 from lihalite/arrow-5836 and squashes the following commits:

67eb3b7 <David Li> Skip Flight domain socket test when path too long
This patch bumps the version and makes sure that the release scripts bump it next time. After some reflection, the other idea of removing `pkgver` from the PKGBUILD probably wouldn't work, given the order of the build steps in makepkg.

Author: Neal Richardson <neal.p.richardson@gmail.com>

Closes apache#4805 from nealrichardson/r-pkgbuild-version and squashes the following commits:

1a84d25 <Neal Richardson> Bump version in ci/PKGBUILD in release
Author: Sutou Kouhei <kou@clear-code.com>

Closes apache#4795 from kou/cpp-macos-grpc-openssl and squashes the following commits:

6014d45 <Sutou Kouhei>  Delegate OPENSSL_ROOT_DIR to bundled gRPC
This works Protobuf_SOURCE=AUTO well on environments that have old
Protocol Buffers.

Author: Sutou Kouhei <kou@clear-code.com>

Closes apache#4785 from kou/cpp-protobuf-version-check and squashes the following commits:

b4e9a88 <Sutou Kouhei>  Add required Protocol Buffers versions check
Author: Antoine Pitrou <antoine@python.org>

Closes apache#4791 from pitrou/ARROW-5775-boxed-fields and squashes the following commits:

9c2a333 <Antoine Pitrou> Add "inline"
6fe0b86 <Antoine Pitrou> ARROW-5775:  Fix thread-unsafe cached data
…gen.sh in dev/release/02-source.sh

c_glib/ source archive is generated by `make dist` because includes configure script.
The current `dev/release/02-source.sh` build Arrow C++ and Arrow GLib to include the artifacts of GTK-Doc and then run `make dist`.  But it is slow.
So this PR run only `c_glib/autogen.sh` and then replace c_glib/.

Author: Yosuke Shiro <yosuke.shiro615@gmail.com>
Author: Sutou Kouhei <kou@clear-code.com>

Closes apache#4749 from shiro615/release-replace-c-glib-after-running-autogen and squashes the following commits:

9a69f8e <Yosuke Shiro> Remove an unnecessary environment variable
3a2550f <Yosuke Shiro> Remove omit from 02-source-test.rb
501a2dd <Yosuke Shiro> Remove autom4te.cache after running autogen.sh
46a4f89 <Sutou Kouhei> Use docker-compose
e357a88 <Yosuke Shiro> Exclude c_glib/autom4te.cache/* from RAT check
70cb4a7 <Yosuke Shiro> Remove an unnecessary diff
aa78680 <Yosuke Shiro> Enable test test_glib_configure on Travis CI
e04276e <Yosuke Shiro> Remove libraries for C++ build
56098ae <Yosuke Shiro>  Replace c_glib/ by c_glib/ after running autogen.sh
Also fix a warning because of static variables in headers.

Author: Antoine Pitrou <antoine@python.org>

Closes apache#4808 from pitrou/ARROW-5851-compile-reference-benchmarks and squashes the following commits:

fb6740a <Antoine Pitrou> ARROW-5851:  Fix compilation of reference benchmarks
Author: Antoine Pitrou <antoine@python.org>

Closes apache#4804 from pitrou/ARROW-5849-mingw-warnings and squashes the following commits:

ff48a18 <Antoine Pitrou> ARROW-5849:  Fix compiler warnings on mingw32
Author: Yosuke Shiro <yosuke.shiro615@gmail.com>

Closes apache#4816 from shiro615/cpp-remove-duplicate-library and squashes the following commits:

1a200d4 <Yosuke Shiro>  Remove duplicate library in cpp/Brewfile
Because gRPC requires c-ares CMake config.

See also: https://lists.apache.org/thread.html/babb7985a8206807dd8893a2c7affdb733f3d561ecfcc7f26ba660d9@%3Cdev.arrow.apache.org%3E

Author: Sutou Kouhei <kou@clear-code.com>

Closes apache#4783 from kou/cpp-c-ares-require-cmake-config and squashes the following commits:

7fe2784 <Sutou Kouhei>  Require c-ares CMake config
… OpenSSL

The wheel build and test build succeeds, just the deployments fails because of removed openssl for testing: https://travis-ci.org/ursa-labs/crossbow/builds/555822505

The currently running build should pass: https://travis-ci.org/ursa-labs/crossbow/builds/555849713
although we have a bunch of warnings `<lib> was built for newer OSX version (10.12) than being linked (10.9)` for the dependencies coming from brew.

cc @xhochy

Author: Krisztián Szűcs <szucs.krisztian@gmail.com>

Closes apache#4823 from kszucs/osx-wheel-openssl and squashes the following commits:

eb45ae4 <Krisztián Szűcs> reinstall openssl
ca8d177 <Krisztián Szűcs> ignore deps
be4ab0f <Krisztián Szűcs> openssl
…n to avoid segfault

As reported on JIRA, the following script provokes a segfault

```
#! /usr/bin/env python

import pyarrow
import sys
del sys.modules['pyarrow.lib']
```

For some reason this does not trigger the destruction of the private `_ExtensionTypesInitializer` object. Not sure why (Antoine may know). Using the atexit module instead seems to do the trick

Author: Wes McKinney <wesm+git@apache.org>

Closes apache#4824 from wesm/ARROW-5863 and squashes the following commits:

ba60578 <Wes McKinney> Use atexit module for extension type finalization
…nylinux2010 image so lz4 is statically linked

I pushed the image to Docker Hub (it was on quay.io before). I'm not sure what's the best way to test for this -- I checked the produced wheels locally to verify that liblz4.so is no longer required

Author: Wes McKinney <wesm+git@apache.org>

Closes apache#4828 from wesm/ARROW-5868 and squashes the following commits:

1fb1acb <Wes McKinney> Remove liblz4 shared libraries from /usr/local so static linking occurs
https://issues.apache.org/jira/browse/ARROW-5873

When cython ExtensionTypes are used as function arguments (can get a value passed by the user), they are allowed to be None, so when accessing attributes from them, need to explicitly check that they are not None (https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#extension-types-and-none)

Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Closes apache#4839 from jorisvandenbossche/ARROW-5873-schema-equals-segfault and squashes the following commits:

c4dbee5 <Joris Van den Bossche> ARROW-5873:  guard for passed None in Schema.equals
…n pa.array

https://issues.apache.org/jira/browse/ARROW-5790

In the `NumPyConverter` constructor, we access the strides of the array with `PyArray_STRIDES(arr_)[0]`, so need to ensure that the array is 1D before passing it there.

Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Closes apache#4837 from jorisvandenbossche/ARROW-5790 and squashes the following commits:

cc66365 <Joris Van den Bossche> ARROW-5790:  raise error when trying to convert 0-dim array in pa.array
- Java servers before didn't actually wait for the Handshake RPC to complete
- Java servers didn't interrupt auth handlers if the client sent an error
- Python/C++ clients didn't explicitly finish their end of the connection

Together, this led to the 'hanging forever' issue @rymurr saw.

I've left some TODOs as I would like to raise Flight-specific exceptions (which I'm working on in parallel).

Travis: https://travis-ci.com/lihalite/arrow/builds/118503572
AppVeyor: https://ci.appveyor.com/project/lihalite/arrow/builds/25858510

Author: David Li <li.davidm96@gmail.com>

Closes apache#4838 from lihalite/arrow-5877 and squashes the following commits:

fc35d19 <David Li> Wait for authentication to complete server-side
Write FieldNodes in correct order in ArrowStreamWriter.
Also, fix a small issue with writing BooleanArrays' NullBitmapBuffer.

@pgovind @chutchinson

Author: Eric Erhardt <eric.erhardt@microsoft.com>

Closes apache#4836 from eerhardt/Fix5887 and squashes the following commits:

f0835de <Eric Erhardt> Write FieldNodes in correct order in ArrowStreamWriter.
pitrou and others added 18 commits July 22, 2019 19:03
Remove some unused functions

Closes apache#4899 from pitrou/ARROW-3032-numpy-headers and squashes the following commits:

e094f3b <Antoine Pitrou> ARROW-3032:  Clean up Numpy-related headers

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
…sicDecimal128::FromDouble

- std::round can round down also.
- also, there is a check further down for overflow (2^127-1).

Author: Pindikura Ravindra <ravindra@dremio.com>

Closes apache#4894 from pravindra/arrow-5964 and squashes the following commits:

8d6f37a <Pindikura Ravindra> fix tests
eaee1ab <Pindikura Ravindra> ARROW-5964: remove overflow check after rounding
… restrict libstdc++ symbols.

I tried more aggressive restrictions that exports only *gandiva::* but unit tests crashed. (see the previous commit in this PR).

Author: Zhuo Peng <1835738+brills@users.noreply.github.com>

Closes apache#4883 from brills/linker-script and squashes the following commits:

67e1347 <Zhuo Peng> cmake format
778a194 <Zhuo Peng> Only restrict symbols from libstdc++
a8e0bac <Zhuo Peng> Added a linker script for Gandiva to limit exported symbols
This changes the PKGBUILD script to use the local checkout to build the C++ library.

Author: Neal Richardson <neal.p.richardson@gmail.com>

Closes apache#4900 from nealrichardson/r-appveyor-local and squashes the following commits:

a87a8cf <Neal Richardson> Revert "Test only mine"
16b7e56 <Neal Richardson> More playing with working directory
ed14a37 <Neal Richardson> Expand path
f0bf364 <Neal Richardson> Try this
eef10c6 <Neal Richardson> Oops
ec8aebf <Neal Richardson> Try without checksum
4fa5f70 <Neal Richardson> Test only mine
214a4fe <Neal Richardson> Try building C++ lib from local git checkout
Related to [ARROW-5968](https://issues.apache.org/jira/browse/ARROW-5968).
Some Preconditions check are duplicate in JdbcToArrow#sqlToArrow

Author: tianchen <niki.lj@alibaba-inc.com>

Closes apache#4896 from tianchen92/jdbc_adapter and squashes the following commits:

0b507ea <tianchen> remove duplicated check
Author: Micah Kornfield <emkornfield@gmail.com>

Closes apache#4903 from emkornfield/namespace_macro and squashes the following commits:

5d06dba <Micah Kornfield> ARROW-5976:  RETURN_IF_ERROR(ctx) should be namespaced
Author: Prudhvi Porandla <prudhvi.porandla@icloud.com>

Closes apache#4887 from pprudhvi/div-impl and squashes the following commits:

aa02da5 <Prudhvi Porandla> add test for float32; specify namespace of trunc
7ac9090 <Prudhvi Porandla> explicit cast after trunc
d6a3d60 <Prudhvi Porandla> lint
014658d <Prudhvi Porandla> div for float, double types
7fcaca8 <Prudhvi Porandla> div for integer types
…lder in writer.cc

- Other small cleanups/comments based on my grokking of the code.
- Vectorize inner loop of List based LevelBuilder where possible

Closes apache#4827 from emkornfield/parquet_cleanup and squashes the following commits:

2f77313 <Wes McKinney> Export operator<< symbols for DataType, TimeUnit on Windows
e1698e5 <Micah Kornfield> fix compilation issues
152a6ee <Micah Kornfield> change back to signed it seems to be what the API uses
67ccf3e <Micah Kornfield> update references
ef7fade <Micah Kornfield> remove unneeded nesting
6776e61 <emkornfield> Update cpp/src/parquet/arrow/writer.cc
7cbc8e4 <emkornfield> Update cpp/src/parquet/arrow/writer.cc
83ee5ae <emkornfield> Update cpp/src/parquet/arrow/writer.cc
325af52 <Micah Kornfield> undo reserve
4516110 <Micah Kornfield> Small style issues
7795352 <Micah Kornfield> change from array builder to buffer builde

Lead-authored-by: Micah Kornfield <emkornfield@gmail.com>
Co-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Wes McKinney <wesm+git@apache.org>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
… authors

See https://help.github.com/en/articles/creating-a-commit-with-multiple-authors. So when multiple people contribute to a PR it will show up in their contribution count. In the past we have only attributed the person with the most commits, or in the event of a tie, the most recent committer in a PR. We're having more PRs with multiple people involved so I think it's nice to acknowledge everyone.

I also added the Signed-off-by: mark which includes the Apache committer information in the commit message

This was implemented in Apache Spark in https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py so I've adopted the approach here.

Closes apache#4882 from wesm/ARROW-5716 and squashes the following commits:

816a137 <Wes McKinney> Do not prompt for lead author if there is only one person involved
74fb732 <Wes McKinney> Put authors at end of commit message so GitHub understands them
6d65ceb <Wes McKinney> Fix DEBUG env variable flag logic
0ebc17e <Wes McKinney> Add debugging output
7924dab <Wes McKinney> Add support for lead/co-authors to merge script

Authored-by: Wes McKinney <wesm+git@apache.org>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
…itive types

Related to [ARROW-5861](https://issues.apache.org/jira/browse/ARROW-5861).
Initial implement to support convert Avro record with primitive types to Arrow objects.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes apache#4812 from tianchen92/ARROW-5861 and squashes the following commits:

2439478 <tianchen> use UnsupportedOperationException
fa3f39a <tianchen> resolve comments
7c3a730 <tianchen> add consumers and use GenericDatumReader
61d2dac <tianchen> fix style
54479c8 <tianchen> Initial implement to convert Avro record with primitive types
… dictionary encoding

As discussed in apache#4792

Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes apache#4846 from tianchen92/hasher and squashes the following commits:

2db7302 <tianchen> fix
5facc2a <tianchen> resolve comments
175192a <tianchen> fix test and style
7a87526 <tianchen> implementation of equals and hashCode
c89608b <tianchen> fix
8f2e1a2 <tianchen> hash table prototype
… duplication

A bunch of business logic had gotten copy-pasted to create parquet/arrow/record_writer.*. This bases ColumnReader/RecordReader off a common private base class and removes other code duplication.

I'm going to base explorations in ARROW-3772 on this

Closes apache#4906 from wesm/PARQUET-1468 and squashes the following commits:

5eb664f <Wes McKinney> Finish cleaning, compiles and tests pass
3df7c93 <Wes McKinney> Consolidate internal::RecordReader, ColumnReader files

Authored-by: Wes McKinney <wesm+git@apache.org>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
Author: Marco Neumann <marco@crepererum.net>

Closes apache#4911 from crepererum/ARROW-5990 and squashes the following commits:

b05e734 <Marco Neumann> add bounds check to RowGroupMetaData.column
Related to [ARROW-5986](https://issues.apache.org/jira/browse/ARROW-5986).

In last few weeks, we did some refactor in dictionary encoding.
Since the new designed hash table for DictionaryEncoder and hashCode & equals API in ValueVector already checked in, some classed are no use anymore like DictionaryEncodingHashTable, BaseBinaryVector and related benchmarks & UT.

Fortunately, these changes are not made into version 0.14, which makes possible to remove them.
I think this should be merged before 0.14.1? @emkornfield

Author: tianchen <niki.lj@alibaba-inc.com>

Closes apache#4909 from tianchen92/ARROW-5986 and squashes the following commits:

bd6d7af <tianchen> ARROW-5986:  Code cleanup for dictionary encoding
…null when the underlying data is null

For variable-width vectors (VarCharVector and VarBinaryVector), when the validity bit is not set, it means the underlying data is null, so the get method should return null.

However, the current implementation throws an IllegalStateException when NULL_CHECKING_ENABLED is set, or returns an empty array when the flag is clear.

Maybe the purpose of this design is to be consistent with fixed-width vectors. However, the scenario is different: fixed-width vectors (e.g. IntVector) throw an IllegalStateException, simply because the primitive types are non-nullable.

Author: liyafan82 <fan_li_ya@foxmail.com>

Closes apache#4901 from liyafan82/fly_0717_varget and squashes the following commits:

8fe83f7 <liyafan82>  Variable width vectors' get methods should return null when the underlying data is null
All types of variable-width vectors can reuse the same comparator for sorting & searching.

Author: liyafan82 <fan_li_ya@foxmail.com>

Closes apache#4860 from liyafan82/fly_0712_varsort and squashes the following commits:

cbd8c3f <liyafan82>   Provide a utility to create the default comparator
46d2c11 <liyafan82>  Support sort & compare for all variable width vectors
Related to [ARROW-5918](https://issues.apache.org/jira/browse/ARROW-5918).

Author: tianchen <niki.lj@alibaba-inc.com>

Closes apache#4859 from tianchen92/ARROW-5918 and squashes the following commits:

3732a78 <tianchen> update javadoc and add unsafe method in BaseIntVector
1562ae1 <tianchen> fix 2
7ddce12 <tianchen> fix
26d30be <tianchen> fix
e65473c <tianchen> fix comments
fe9c2ec <tianchen> fix
9b8afa6 <tianchen> enable null checking
9f2fc09 <tianchen> fix
5739cc4 <tianchen> ARROW-5918:  Add get to BaseIntVector interface
Closes apache#4907 from efiop/patch-1 and squashes the following commits:

e4a2743 <Ruslan Kuprieiev> ARROW-5989:  accommodate openjdk-8 path search prefix

Authored-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kszucs
Copy link
Member

kszucs commented Jul 22, 2019

@pitrou it builds locally, the first three commits should not be there though

@pitrou pitrou force-pushed the ARROW-750-large-binary branch from 1ade42b to e88eddf Compare July 22, 2019 19:19
These are like Binary and String respectively, except with 64-bit offsets
so as to allow extremely large individual values.
@pitrou
Copy link
Member Author

pitrou commented Jul 22, 2019

Closing in favour of #4921

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.