|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Apache Arrow 0.7.0 Release" |
| 4 | +date: "2017-09-18 00:00:00 -0400" |
| 5 | +author: wesm |
| 6 | +categories: [release] |
| 7 | +--- |
| 8 | +<!-- |
| 9 | +{% comment %} |
| 10 | +Licensed to the Apache Software Foundation (ASF) under one or more |
| 11 | +contributor license agreements. See the NOTICE file distributed with |
| 12 | +this work for additional information regarding copyright ownership. |
| 13 | +The ASF licenses this file to you under the Apache License, Version 2.0 |
| 14 | +(the "License"); you may not use this file except in compliance with |
| 15 | +the License. You may obtain a copy of the License at |
| 16 | +
|
| 17 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 18 | +
|
| 19 | +Unless required by applicable law or agreed to in writing, software |
| 20 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 21 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 22 | +See the License for the specific language governing permissions and |
| 23 | +limitations under the License. |
| 24 | +{% endcomment %} |
| 25 | +--> |
| 26 | + |
| 27 | +The Apache Arrow team is pleased to announce the 0.7.0 release. It includes |
| 28 | +[**133 resolved JIRAs**][1] many new features and bug fixes to the various |
| 29 | +language implementations. The Arrow memory format remains stable since the |
| 30 | +0.3.x release. |
| 31 | + |
| 32 | +See the [Install Page][2] to learn how to get the libraries for your |
| 33 | +platform. The [complete changelog][3] is also available. |
| 34 | + |
| 35 | +We include some highlights from the release in this post. |
| 36 | + |
| 37 | +## New PMC Member: Kouhei Sutou |
| 38 | + |
| 39 | +Since the last release we have added [Kou][4] to the Arrow Project Management |
| 40 | +Committee. He is also a PMC for Apache Subversion, and a major contributor to |
| 41 | +many other open source projects. |
| 42 | + |
| 43 | +As an active member of the Ruby community in Japan, Kou has been developing the |
| 44 | +GLib-based C bindings for Arrow with associated Ruby wrappers, to enable Ruby |
| 45 | +users to benefit from the work that's happening in Apache Arrow. |
| 46 | + |
| 47 | +We are excited to be collaborating with the Ruby community on shared |
| 48 | +infrastructure for in-memory analytics and data science. |
| 49 | + |
| 50 | +## Expanded JavaScript (TypeScript) Implementation |
| 51 | + |
| 52 | +[Paul Taylor][5] from the [Falcor][7] and [ReactiveX][6] projects has worked to |
| 53 | +expand the JavaScript implementation (which is written in TypeScript), using |
| 54 | +the latest in modern JavaScript build and packaging technology. We are looking |
| 55 | +forward to building out the JS implementation and bringing it up to full |
| 56 | +functionality with the C++ and Java implementations. |
| 57 | + |
| 58 | +We are looking for more JavaScript developers to join the project and work |
| 59 | +together to make Arrow for JS work well with many kinds of front end use cases, |
| 60 | +like real time data visualization. |
| 61 | + |
| 62 | +## Type casting for C++ and Python |
| 63 | + |
| 64 | +As part of longer-term efforts to build an Arrow-native in-memory analytics |
| 65 | +library, we implemented a variety of type conversion functions. These functions |
| 66 | +are essential in ETL tasks when conforming one table schema to another. These |
| 67 | +are similar to the `astype` function in NumPy. |
| 68 | + |
| 69 | +```python |
| 70 | +In [17]: import pyarrow as pa |
| 71 | + |
| 72 | +In [18]: arr = pa.array([True, False, None, True]) |
| 73 | + |
| 74 | +In [19]: arr |
| 75 | +Out[19]: |
| 76 | +<pyarrow.lib.BooleanArray object at 0x7ff6fb069b88> |
| 77 | +[ |
| 78 | + True, |
| 79 | + False, |
| 80 | + NA, |
| 81 | + True |
| 82 | +] |
| 83 | + |
| 84 | +In [20]: arr.cast(pa.int32()) |
| 85 | +Out[20]: |
| 86 | +<pyarrow.lib.Int32Array object at 0x7ff6fb0383b8> |
| 87 | +[ |
| 88 | + 1, |
| 89 | + 0, |
| 90 | + NA, |
| 91 | + 1 |
| 92 | +] |
| 93 | +``` |
| 94 | + |
| 95 | +Over time these will expand to support as many input-and-output type |
| 96 | +combinations with optimized conversions. |
| 97 | + |
| 98 | +## New Arrow GPU (CUDA) Extension Library for C++ |
| 99 | + |
| 100 | +To help with GPU-related projects using Arrow, like the [GPU Open Analytics |
| 101 | +Initiative][8], we have started a C++ add-on library to simplify Arrow memory |
| 102 | +management on CUDA-enabled graphics cards. We would like to expand this to |
| 103 | +include a library of reusable CUDA kernel functions for GPU analytics on Arrow |
| 104 | +columnar memory. |
| 105 | + |
| 106 | +For example, we could write a record batch from CPU memory to GPU device memory |
| 107 | +like so (some error checking omitted): |
| 108 | + |
| 109 | +```c++ |
| 110 | +#include <arrow/api.h> |
| 111 | +#include <arrow/gpu/cuda_api.h> |
| 112 | + |
| 113 | +using namespace arrow; |
| 114 | + |
| 115 | +gpu::CudaDeviceManager* manager; |
| 116 | +std::shared_ptr<gpu::CudaContext> context; |
| 117 | + |
| 118 | +gpu::CudaDeviceManager::GetInstance(&manager) |
| 119 | +manager_->GetContext(kGpuNumber, &context); |
| 120 | + |
| 121 | +std::shared_ptr<RecordBatch> batch = GetCpuData(); |
| 122 | + |
| 123 | +std::shared_ptr<gpu::CudaBuffer> device_serialized; |
| 124 | +gpu::SerializeRecordBatch(*batch, context_.get(), &device_serialized)); |
| 125 | +``` |
| 126 | +
|
| 127 | +We can then "read" the GPU record batch, but the returned `arrow::RecordBatch` |
| 128 | +internally will contain GPU device pointers that you can use for CUDA kernel |
| 129 | +calls: |
| 130 | +
|
| 131 | +``` |
| 132 | +std::shared_ptr<RecordBatch> device_batch; |
| 133 | +gpu::ReadRecordBatch(batch->schema(), device_serialized, |
| 134 | + default_memory_pool(), &device_batch)); |
| 135 | + |
| 136 | +// Now run some CUDA kernels on device_batch |
| 137 | +``` |
| 138 | +
|
| 139 | +## Decimal Integration Tests |
| 140 | +
|
| 141 | +[Phillip Cloud][9] has been working on decimal support in C++ to enable Parquet |
| 142 | +read/write support in C++ and Python, and also end-to-end testing against the |
| 143 | +Arrow Java libraries. |
| 144 | +
|
| 145 | +In the upcoming releases, we hope to complete the remaining data types that |
| 146 | +need end-to-end testing between Java and C++: |
| 147 | +
|
| 148 | +* Fixed size lists (variable-size lists already implemented) |
| 149 | +* Fixes size binary |
| 150 | +* Unions |
| 151 | +* Maps |
| 152 | +* Time intervals |
| 153 | +
|
| 154 | +## Other Notable Python Changes |
| 155 | +
|
| 156 | +Some highlights of Python development outside of bug fixes and general API |
| 157 | +improvements include: |
| 158 | +
|
| 159 | +* Simplified `put` and `get` arbitrary Python objects in Plasma objects |
| 160 | +* Object serialization functions: LINK TO DOCS |
| 161 | +
|
| 162 | +* New `flavor='spark'` option to `pyarrow.parquet.write_table` to enable easy |
| 163 | + writing of Parquet files maximized for Spark compatibility |
| 164 | +
|
| 165 | +* `parquet.write_to_dataset` function with support for partitioning |
| 166 | +* Improved support for Dask filesystems |
| 167 | +* Improved usability for IPC (schema, record batch read/write) |
| 168 | +
|
| 169 | +## The Road Ahead |
| 170 | +
|
| 171 | +Upcoming Arrow releases will continue to expand the project to cover more use |
| 172 | +cases. In addition to completing end-to-end testing for all the major data |
| 173 | +types, some of us will be shifting attention to building Arrow-native in-memory |
| 174 | +analytics libraries. |
| 175 | +
|
| 176 | +We are looking for more JavaScript, R, and other programming language |
| 177 | +developers to join the project and expand the available implementations and |
| 178 | +bindings to more languages. |
| 179 | +
|
| 180 | +[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.7.0 |
| 181 | +[2]: http://arrow.apache.org/install |
| 182 | +[3]: http://arrow.apache.org/release/0.7.0.html |
| 183 | +[4]: https://github.com/kou |
| 184 | +[5]: https://github.com/trxcllnt |
| 185 | +[6]: http://reactivex.io |
| 186 | +[7]: https://github.com/netflix/falcor |
| 187 | +[8]: http://gpuopenanalytics.com/ |
| 188 | +[9]: http://github.com/cpcloud |
0 commit comments