Skip to content

Conversation

@Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Dec 1, 2017

This closes ARROW-971.

kou and others added 30 commits September 17, 2017 13:45
Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1092 from kou/glib-travis-macos and squashes the following commits:

291808b [Kouhei Sutou] [GLib] Use Xcode 8.3 on Travis CI
If you use `@rpath` for install_name (default), you can use the
DYLD_LIBRARY_PATH environment variable to find libarrow.dylib. But the
DYLD_LIBRARY_PATH environment variable isn't inherited to sub process by
System Integration Protection (SIP). It's difficult to use
libarrow.dylib.

You can use full path install_name by -DARROW_INSTALL_NAME_RPATH=OFF
CMake option. If you use it, you can find libarrow.dylib without
DYLD_LIBRARY_PATH environment variable.

Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1100 from kou/cpp-macos-support-install-name and squashes the following commits:

8207ace [Kouhei Sutou] [C++] Support building with full path install_name on macOS
…ld verification script

I found that the script did not work due to the remnants of the last time I ran it.

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1101 from wesm/ARROW-1542 and squashes the following commits:

0718370 [Wes McKinney] Install packages in temporary directory in MSVC build verification script
Since we're accumulating a bunch of components, I started this script which we can refine to make verifying releases easier for others.

I bootstrapped some pieces off https://github.com/apache/parquet-cpp/blob/master/dev/release/verify-release-candidate, very helpful!

This script:

* Checks GPG signature, checksums
* Installs temporary Python install for the duration of these tests
* Builds/install C++ and runs tests (with Python and Plasma)
* Builds parquet-cpp against the Arrow RC
* Python (with Parquet and Plasma extensions)
* C GLib (requires Ruby in PATH and the gems indicated in README)
* Integration tests
* JavaScript (requires NodeJS >= 6.0.0)

There are some potentially snowflake-y aspects to my environment:

* BOOST_ROOT is set to a Boost install location containing libraries built with `-fPIC`. I'm not sure what to do about this one. One maybe better option is to use system level boost and shared libraries
* Maven 3.3.9 is in PATH
* NodeJS 6.11.3 is in PATH

There are probably some other things that Linux users will run into as they run this script.

I had to compile GLib libraries in this since the ones at system level (Ubuntu 14.04) are too old.

cc @kou @xhochy

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1102 from wesm/ARROW-559 and squashes the following commits:

8fd6530 [Wes McKinney] Use Boost shared libraries
3531927 [Wes McKinney] Add note to dev/README.md
079b5e4 [Wes McKinney] Fix comments
17f7ac0 [Wes McKinney] More fixes, finally works
adb3146 [Wes McKinney] More work on release verification script
86ef171 [Wes McKinney] Start Linux release verification script
Closes apache#1107

Change-Id: I9cb83279900aed8e04ef8baf049e30c5007e6538
Ubuntu 14.04 ships GLib 2.40.

Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1106 from kou/glib-support-glib-2.40-again and squashes the following commits:

cbcdf9a [Kouhei Sutou] [GLib] Support GLib 2.40 again
Resolves https://issues.apache.org/jira/browse/ARROW-1544

Author: Paul Taylor <paul.e.taylor@me.com>

Closes apache#1103 from trxcllnt/js-export-vector-typedefs and squashes the following commits:

91a0625 [Paul Taylor] use gulp 4 from github. thought 4-alpha was on npm already.
e5a1034 [Paul Taylor] fix jest test coverage script
c6b09ee [Paul Taylor] export Vector types on root Arrow export
032ad27 [Paul Taylor] add compileOnSave (now required by TS 2.5?)
eb96552 [Paul Taylor] update dependencies
…E.md of c_glib

Add some detailed explanation of common build problems especially on macOS because it requires some tweaks.

Author: Wataru Shimizu <waruzilla@gmail.com>

Closes apache#1104 from wagavulin/build-troubleshooting and squashes the following commits:

9b65542 [Wataru Shimizu] Improve format and the explanation of installing/linking autoconf archive on macOS.
b6c5274 [Wataru Shimizu] Add "Common build problems" section in the README.md of c_glib
`append_values()` are for bulk values append.
`append_nulls()` are for bulk nulls append.

Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1110 from kou/glib-support-bulk-append-in-builder and squashes the following commits:

4926031 [Kouhei Sutou] [GLib] Support bulk append in builder
…iter.close to avoid Windows flakiness

I can reproduce this failure locally, but I'm unsure why this just now started happening. The 0.7.0 release build passed (https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/build/1.0.3357/job/477b1iicmwuy51l8) and there haven't been related code changes since then. Either way it's better to close the sink explicitly

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1114 from wesm/ARROW-1550 and squashes the following commits:

863827c [Wes McKinney] Check status
7248c79 [Wes McKinney] Explicitly close owned file handles in ParquetWriter.close to avoid flakiness on Windows
I drafted a post to publish tomorrow. If anyone would like to make some changes or additions please post a link to a git commit here for me to cherry pick

cc @kou @trxcllnt

@pcmoritz I think we should write a whole blog post about the object serialization functions. The perf wins over pickle when working with large datasets are a pretty big deal

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1111 from wesm/ARROW-1551 and squashes the following commits:

3e05047 [Wes McKinney] Update publication date to 19 September
a9f8770 [Wes McKinney] More edits, links
8c877d9 [Wes McKinney] Draft 0.7.0 release post
Change-Id: I8842358bbdc66635380891982ab3842018615fd9
Change-Id: I9c27893ebfee46364d78963fe20a43f06a1aa700
…ty for computing target memory requirement

cc @jacques-n , This is same as apache#1097

The latter one was closed as I had to rename the branch correctly and use the correct JIRA number.

Author: siddharth <siddharth@dremio.com>

Closes apache#1112 from siddharthteotia/ARROW-1533 and squashes the following commits:

4c97be4 [siddharth] ARROW-1533: realloc should consider the existing buffer capacity for computing target memory requirement
Problem:

Typically there are 3 ways of specifying the amount of memory needed for vectors.
CASE (1) allocateNew() – here the application doesn't really specify the size of memory or value count. Each vector type has a default value count (4096) and therefore a default size (in bytes) is used in such cases.

For example, for a 4 byte fixed-width vector, we will allocate 32KB of memory for a call to allocateNew().

CASE (2) setInitialCapacity(count) followed by allocateNew() - In this case also the application doesn't specify the value count or size in allocateNew(). However, the call to setInitialCapacity() dictates the amount of memory the subsequent call to allocateNew() will allocate.

For example, we can do setInitialCapacity(1024) and the call to allocateNew() will allocate 4KB of memory for the 4 byte fixed-width vector.

CASE (3) allocateNew(count) - The application is specific about requirements.
For nullable vectors, the above calls also allocate the memory for validity vector.

The problem is that Bit Vector uses a default memory size in bytes of 4096. In other words, we allocate a vector for 4096*8 value count.

In the default case (as explained above), the vector types have a value count of 4096 so we need only 4096 bits (512 bytes) in the bit vector and not really 4096 as the size in bytes.

This happens in CASE 1 where the application depends on the default memory allocation . In such cases, the size of buffer for bit vector is 8x than actually needed

Author: siddharth <siddharth@dremio.com>

Closes apache#1109 from siddharthteotia/ARROW-1547 and squashes the following commits:

c92164a [siddharth] addressed review comments
f3d1234 [siddharth] ARROW-1547: Fix 8x memory over-allocation in BitVector
Author: Deepak Majeti <deepak.majeti@hpe.com>

Closes apache#1105 from majetideepak/ARROW-1536 and squashes the following commits:

9f4ed61 [Deepak Majeti] Review comments
d49e1aa [Deepak Majeti] Fix failure
055dc30 [Deepak Majeti] ARROW-1536:[C++] Do not transitively depend on libboost_system
…time may need to be installed on Windows

Close apache#819 (tidying)

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1115 from wesm/ARROW-1554 and squashes the following commits:

a7c3e27 [Wes McKinney] Update Sphinx install page to note that VC14 runtime may need to be installed separately when using pip on Windows
 Implement setInitialCapacity for MapWriter and pass on this capacity during lazy creation of child vectors

cc @jacques-n , @StevenMPhillips

Author: siddharth <siddharth@dremio.com>

Closes apache#1113 from siddharthteotia/ARROW-1553 and squashes the following commits:

5a759be [siddharth] ARROW-1553:  Implement setInitialCapacity for MapWriter and pass on this capacity during lazy creation of child vectors
We now raise a ValueError when the length of the names doesn't match
the length of the arrays.

```python
In [1]: import pyarrow as pa

In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], names=['a', 'b', 'c'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-cda803f3f774> in <module>()
----> 1 pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], names=['a', 'b', 'c'])

table.pxi in pyarrow.lib.Table.from_arrays()

table.pxi in pyarrow.lib._schema_from_arrays()

ValueError: Length of names (3) does not match length of arrays (2)
```

This affected `RecordBatch.from_arrays` and `Table.from_arrays`.

Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1117 from TomAugspurger/validate-names and squashes the following commits:

4df6f59 [Tom Augspurger] REF: avoid redundant len calculation
965a560 [Wes McKinney] Fix test failure exposed in test_parquet.py
ed74d52 [Tom Augspurger] ARROW-1557 [Python] Validate names length in Table.from_arrays
Author: Li Jin <ice.xelloss@gmail.com>

Closes apache#1067 from icexelloss/json-reader-ARROW-1497 and squashes the following commits:

6d4e1df [Li Jin] Fix JsonReader to read union vectors correctly
…ppedFile::Create

Author: Amir Malekpour <a.malekpour@gmail.com>

Author: Amir Malekpour <a.malekpour@gmail.com>

Closes apache#1116 from amirma/arrow-1500 and squashes the following commits:

689aaa9 [Amir Malekpour]  RROW-1500: [C++] Do not ignore return value from truncate in MemoryMappedFile::Create
…_script stage to fail faster

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1118 from wesm/ARROW-1578 and squashes the following commits:

0bb5202 [Wes McKinney] System python not available on xcode 6.4 machines
d1cf679 [Wes McKinney] Set language: python when linting on macOS
910f684 [Wes McKinney] Fixes for linting. Do not cache .conda_packages
ed9e23a [Wes McKinney] Move linting to separate shell script
b7db083 [Wes McKinney] Only run lint checks when not running in --only-library mode
7e50fad [Wes McKinney] Revert cpplint failure
28fc3fb [Wes McKinney] Typo
329f017 [Wes McKinney] Run lint checks before compiling anything. Make cpplint warning
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1121 from xhochy/ARROW-1591 and squashes the following commits:

0b3a11a [Uwe L. Korn] ARROW-1591: C++: Xcode 9 is not correctly detected
This makes the child fields of ListVector have consistent names of `ListVector.DATA_VECTOR_NAME`. Previously, an empty ListVector would have a child name of `ZeroVector.name` which is "[DEFAULT]".

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Steven Phillips <steven@dremio.com>

Closes apache#1119 from BryanCutler/java-ListVector-child-name-ARROW-1347 and squashes the following commits:

c240378 [Bryan Cutler] changed to use instanceof and added test
2923a45 [Steven Phillips] ARROW-1347: [JAVA] return consistent child field name for List vectors
…broken builds

One of the dependencies installed in the docs requirements is causing NumPy to get downgraded by the SAT solver, and this is then causing an ABI conflict with the pyarrow build (which was built with a different version of NumPy). This installs everything in one `conda install` call

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1123 from wesm/ARROW-1595 and squashes the following commits:

60b05ad [Wes McKinney] Install conda dependencies all at once, pin NumPy version
This PR fixes the Table generics to infer the types from the call site:

![kapture 2017-09-21 at 4 03 34](https://user-images.githubusercontent.com/178183/30692953-5b8638d6-9e82-11e7-9d66-b87eb50f0e3f.gif)

@wesm this PR also includes the fixes to the prepublish script I mentioned yesterday.

Author: Paul Taylor <paul.e.taylor@me.com>

Closes apache#1120 from trxcllnt/fix-ts-typings and squashes the following commits:

73d8eee [Paul Taylor] make package the default gulp task
1d269fe [Paul Taylor] flow table method generics
dd1e819 [Paul Taylor] more defensively typed reader internal values
ac6a778 [Paul Taylor] add comments explaining ARROW-1363 reader workaround
e37f885 [Paul Taylor] fix gulp and prepublish scripts
58fa201 [Paul Taylor] enforce exact dependency package versions
Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1122 from kou/glib-add-uint-array-builder and squashes the following commits:

24bb9a7 [Kouhei Sutou] [GLib] Add missing "unsigned"
fd23f24 [Kouhei Sutou] [GLib] Fix build error on macOS
5b59775 [Kouhei Sutou] [GLib] Add UIntArrayBuilder
Even though fixed object id is used in implementation, comment says random object id is created.

Author: Kentaro Hayashi <hayashi@clear-code.com>

Closes apache#1124 from kenhys/arrow-1598 and squashes the following commits:

dc5934e [Kentaro Hayashi] ARROW-1598: [C++] Fix diverged code comment in plasma tutorial
…ternal::BitmapReader in lieu of macros

@xhochy since this is causing the crash reported in ARROW-1601 we may want to do a patch release 0.7.1 and parquet-cpp 1.3.1

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1126 from wesm/ARROW-1601 and squashes the following commits:

6cec81c [Wes McKinney] Fix RleDecoder logic with BitmapReader
ba58b8a [Wes McKinney] Fix test name
fa47865 [Wes McKinney] Add BitmapReader class to replace the bitset macros
mrandrewandrade and others added 14 commits November 26, 2017 14:02
* updated file path to brew install for repos dir

* added information about bundled wheel build
Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1361 from kou/glib-dictionary-data-type and squashes the following commits:

6ccce1f [Kouhei Sutou] [GLib] Add GArrowDictionaryDataType
This closes [ARROW-1758](https://issues.apache.org/jira/browse/ARROW-1758).

Author: Licht-T <licht-t@outlook.jp>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1347 from Licht-T/clean-pickle-option-for-object-serialization and squashes the following commits:

927f154 [Wes McKinney] Use cloudpickle for lambda serialization if available
ba998dd [Licht-T] CLN: Remove pickle=True option for object serialization
…der, Table.to_batches method

This also fixes ARROW-504 by adding a chunksize option when writing tables to a RecordBatch stream in Python

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1364 from wesm/ARROW-1178 and squashes the following commits:

a31e258 [Wes McKinney] Add chunksize argument to RecordBatchWriter.write_table
dc6023a [Wes McKinney] Implement Table.to_batches, add tests
Change-Id: I1db065001e7fc196128e8f8c36b3406a89ccbdd5
This makes for a more convenient / less rigid API without as need for as many usages of `reinterpret_cast<const uint8_t*>`. This does not impact downstream projects (e.g. parquet-cpp is unaffected) unless they provide implementations of these virtual interfaces.

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1363 from wesm/ARROW-1850 and squashes the following commits:

af5a348 [Wes McKinney] Update glib, arrow-gpu for API changes
5d5cf2d [Wes McKinney] Use void* / const void* for buffers in file APIs
…erialized Python object with minimal allocation

For systems (like Dask) that prefer to handle their own framed buffer transport, this provides a list of memoryview-compatible objects with minimal copying / allocation from the input data structure, which can similarly be zero-copy reconstructed to the original object.

To motivate the use case, consider a dict of ndarrays:

```
data = {i: np.random.randn(1000, 1000) for i in range(50)}
```

Here, we have:

```
>>> %timeit serialized = pa.serialize(data)
52.7 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

This is about 400MB of data. Some systems may not want to double memory by assembling this into a single large buffer, like with the `to_buffer` method:

```
>>> written = serialized.to_buffer()
>>> written.size
400015456
```

We provide a `to_components` method which contains a dict with a `'data'` field containing a list of `pyarrow.Buffer` objects. This can be converted back to the original Python object using `pyarrow.deserialize_components`:

```
>>> %timeit components = serialized.to_components()
73.8 µs ± 812 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> list(components.keys())
['num_buffers', 'data', 'num_tensors']

>>> len(components['data'])
101

>>> type(components['data'][0])
pyarrow.lib.Buffer
```

and

```
>>> %timeit recons = pa.deserialize_components(components)
93.6 µs ± 260 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

The reason there are 101 data components (1 + 2 * 50) is that:

* 1 buffer for the serialized Union stream representing the object
* 2 buffers for each of the tensors: 1 for the metadata and 1 for the tensor body. The body is separate so that this is zero-copy from the input

Next step after this is ARROW-1784 which is to transport a pandas.DataFrame using this mechanism

cc @pitrou @jcrist @mrocklin

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1362 from wesm/ARROW-1783 and squashes the following commits:

4ec5a89 [Wes McKinney] Add missing decref on error
e8c76d4 [Wes McKinney] Acquire GIL in GetSerializedFromComponents
1d2e0e2 [Wes McKinney] Fix function documentation
fffc7bb [Wes McKinney] Typos, add deserialize_components to API
50d2fee [Wes McKinney] Finish componentwise serialization roundtrip
58174dd [Wes McKinney] More progress, stubs for reconstruction
b1e31a3 [Wes McKinney] Draft GetTensorMessage
337e1d2 [Wes McKinney] Draft SerializedPyObject::GetComponents
598ef33 [Wes McKinney] Tweak
This removes non-nullable vectors that are no longer part of the vector class hierarchy and renames Nullable*Vector classes to remove the Nullable prefix.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#1341 from BryanCutler/java-nullable-vector-rename-ARROW-1710 and squashes the following commits:

7d930dc [Bryan Cutler] fixed realloc test
ff2120d [Bryan Cutler] clean up test
374dfcc [Bryan Cutler] properly rename BitVector file
6b7a85e [Bryan Cutler] remove old BitVector.java before rebase
089f7fc [Bryan Cutler] some minor cleanup
4e580d9 [Bryan Cutler] removed legacy BitVector
74f771f [Bryan Cutler] fixed remaining tests
8c5dfef [Bryan Cutler] fix naming in support classes
6e498e5 [Bryan Cutler] removed nullable prefix
dfed444 [Bryan Cutler] removed non-nullable vectors
…zero offset

This uncovered some bugs. I inspected the other kernels that are untested and while they look fine, at some point we may want to add some more extensive unit tests about this

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1369 from wesm/ARROW-1735 and squashes the following commits:

de41d92 [Wes McKinney] Test CastKernel writing into output array with non-zero offset
**Just posting this for discussion.** See the preceding discussion on https://issues.apache.org/jira/browse/ARROW-1854.

I think the ideal way to solve this would actually be to improve our handling of lists, which should be possible given that pickle seems to outperform us by 6x according to the benchmarks in https://issues.apache.org/jira/browse/ARROW-1854.

Note that the implementation in this PR will not handle numpy arrays of user-defined classes because it will not fall back to cloudpickle when needed.

cc @pcmoritz @wesm

Author: Wes McKinney <wes.mckinney@twosigma.com>
Author: Robert Nishihara <robertnishihara@gmail.com>

Closes apache#1360 from robertnishihara/numpyobject and squashes the following commits:

c37a0a0 [Wes McKinney] Fix flake
5191503 [Wes McKinney] Fix post rebase
43f2c80 [Wes McKinney] Add SerializationContext.clone method. Add pandas_serialization_context member that uses pickle for NumPy arrays with unsupported tensor types
c944023 [Wes McKinney] Use pickle.HIGHEST_PROTOCOL, convert to Buffer then memoryview for more memory-efficient transport
cf719c3 [Robert Nishihara] Use pickle to serialize numpy arrays of objects.
…ath prefix

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1366 from wesm/ARROW-1684 and squashes the following commits:

e63e42a [Wes McKinney] Support selecting nested Parquet fields by any path prefix
…ength strings

I also fixed a bug this surfaced in the hash table resize (unit test coverage was not adequate)

Now we have

```
$ ./release/compute-benchmark
Run on (8 X 4200.16 MHz CPU s)
2017-11-28 18:33:53
Benchmark                                                           Time           CPU Iterations
-------------------------------------------------------------------------------------------------
BM_BuildDictionary/min_time:1.000                                1352 us       1352 us       1038   2.88639GB/s
BM_BuildStringDictionary/min_time:1.000                          3994 us       3994 us        351   75.5809MB/s
BM_UniqueInt64NoNulls/16M/50/min_time:1.000/real_time           35814 us      35816 us         39   3.49023GB/s
BM_UniqueInt64NoNulls/16M/1024/min_time:1.000/real_time        119656 us     119660 us         12   1069.73MB/s
BM_UniqueInt64NoNulls/16M/10k/min_time:1.000/real_time         174924 us     174930 us          8   731.747MB/s
BM_UniqueInt64NoNulls/16M/1024k/min_time:1.000/real_time       448425 us     448440 us          3   285.443MB/s
BM_UniqueInt64WithNulls/16M/50/min_time:1.000/real_time         49511 us      49513 us         29   2.52468GB/s
BM_UniqueInt64WithNulls/16M/1024/min_time:1.000/real_time      134519 us     134523 us         10   951.541MB/s
BM_UniqueInt64WithNulls/16M/10k/min_time:1.000/real_time       191331 us     191336 us          7   668.999MB/s
BM_UniqueInt64WithNulls/16M/1024k/min_time:1.000/real_time     533597 us     533613 us          3   239.882MB/s
BM_UniqueString10bytes/16M/50/min_time:1.000/real_time         150731 us     150736 us          9    1061.5MB/s
BM_UniqueString10bytes/16M/1024/min_time:1.000/real_time       256929 us     256938 us          5   622.739MB/s
BM_UniqueString10bytes/16M/10k/min_time:1.000/real_time        414412 us     414426 us          3    386.09MB/s
BM_UniqueString10bytes/16M/1024k/min_time:1.000/real_time     1744253 us    1744308 us          1   91.7298MB/s
BM_UniqueString100bytes/16M/50/min_time:1.000/real_time        563890 us     563909 us          2   2.77093GB/s
BM_UniqueString100bytes/16M/1024/min_time:1.000/real_time      704695 us     704720 us          2   2.21727GB/s
BM_UniqueString100bytes/16M/10k/min_time:1.000/real_time       995685 us     995721 us          2   1.56927GB/s
BM_UniqueString100bytes/16M/1024k/min_time:1.000/real_time    3584108 us    3584230 us          1   446.415MB/s
```

We can also refactor the hash table implementations without worrying too much about whether we're making things slower

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1370 from wesm/ARROW-1844 and squashes the following commits:

638f1a1 [Wes McKinney] Decrease resize load factor to 0.5
2885c64 [Wes McKinney] Multiply bytes processed by state.iterations()
f7b3619 [Wes McKinney] Add initial Unique benchmarks for int64, strings
JIRA: https://issues.apache.org/jira/browse/ARROW-1869

This PR fixes spelling error in class name for `LowCostIdentityHashMap`.
Follow-up for apache#1150.

Author: Ivan Sadikov <ivan.sadikov@team.telstra.com>

Closes apache#1372 from sadikovi/fix-low-cost-identity-hash-map and squashes the following commits:

e3529f6 [Ivan Sadikov] fix low cost identity hash map name
Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1365 from kou/glib-dictionary-array and squashes the following commits:

83bfa13 [Kouhei Sutou] [GLib] Add GArrowDictionaryArray
@Licht-T
Copy link
Contributor Author

Licht-T commented Dec 1, 2017

CI on Windows failed because of logging.h.

  c:\projects\arrow\cpp\src\arrow\util\logging.h(138): error C2220: warning treated as error - no 'object' file generated [C:\projects\arrow\cpp\build\src\arrow\arrow_static.vcxproj]
"C:\projects\arrow\cpp\build\INSTALL.vcxproj" (default target) (1) ->
"C:\projects\arrow\cpp\build\ALL_BUILD.vcxproj" (default target) (3) ->
"C:\projects\arrow\cpp\build\src\arrow\python\arrow_python_shared.vcxproj" (default target) (17) ->
"C:\projects\arrow\cpp\build\src\arrow\arrow_shared.vcxproj" (default target) (18) ->
  c:\projects\arrow\cpp\src\arrow\util\logging.h(138): error C2220: warning treated as error - no 'object' file generated [C:\projects\arrow\cpp\build\src\arrow\arrow_shared.vcxproj]
    79 Warning(s)
    2 Error(s)
Time Elapsed 00:13:59.08

kou and others added 5 commits December 1, 2017 16:14
Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1377 from kou/glib-unique and squashes the following commits:

4385e22 [Kouhei Sutou] Add garrow_array_unique()
…lues

The last upgrade of the Jackson JSON library changed behavior to no longer allow reading of "NaN" values  by default.  This change configures the JSON generator and parser to allow for NaN values (unquoted) alongside standard floating point numbers.  A test was added for JSON writing/reading and modified the test for Arrow file and stream .

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#1375 from BryanCutler/java-JsonReader-all_non_numeric-ARROW-1817 and squashes the following commits:

4c4682a [Bryan Cutler] configure JsonWriter to write NaN not as strings, add test for read and write of float with NaN
1fa24f4 [Bryan Cutler] added conf for JacksonParser to allow NaN tokens
@wesm wesm force-pushed the feature-isnull-array branch from 76da1ea to 73a0328 Compare December 1, 2017 21:38
@wesm
Copy link
Member

wesm commented Dec 1, 2017

I need to think a bit about this one. The Python API side is OK, but on the C++ side we might want to make this kernel-like

@Licht-T
Copy link
Contributor Author

Licht-T commented Dec 3, 2017

@wesm So you mean this should be implemented as reusable in another component?

@wesm
Copy link
Member

wesm commented Dec 4, 2017

Right, this computation might be a unit of work in some more general computational pipeline, so it would be useful for this to be implemented with a similar API to other array kernel functions

@cpcloud
Copy link
Contributor

cpcloud commented Feb 1, 2018

@Licht-T What's the status of this PR? Do you plan to move this to a kernel?

@wesm
Copy link
Member

wesm commented Oct 8, 2018

This is marked for the 0.12 release. We should rewrite as a kernel

@wesm
Copy link
Member

wesm commented Jan 4, 2019

Closing as stale, let us revisit in an upcoming release cycle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.