Skip to content

Conversation

@mrkn
Copy link
Member

@mrkn mrkn commented Jun 25, 2020

To reduce the size of libarrow.so, I'd like to reduce the size of tensor and sparse tensor codes.

TODO:

  • Stop using template parameters in Tensor to SparseTensor converters
  • Stop using template parameters in SparseTensor to Tensor converters

@mrkn mrkn marked this pull request as draft June 25, 2020 06:28
@github-actions
Copy link

@wesm
Copy link
Member

wesm commented Jun 25, 2020

This removes already more than 2MB of code from libarrow.so on Linux: great. I'll keep an eye on this

@mrkn mrkn force-pushed the ARROW-9156 branch 2 times, most recently from af20396 to 1e54d34 Compare June 29, 2020 05:33
@mrkn mrkn marked this pull request as ready for review June 29, 2020 06:48
@mrkn mrkn requested a review from wesm June 29, 2020 06:49
@mrkn
Copy link
Member Author

mrkn commented Jun 29, 2020

@wesm Could you please review this?

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, this is definitely very beneficial both for compilation time and code size. I left some minor comments about reducing logic duplication in a few places by factoring some things out into a helper function but otherwise this looks good, and the existing test suite gives good evidence that things are working as expected

I'm sensitive to the possible performance issues that this refactor may introduce (by changing inline functions to C function calls or functions with switch statements in them but per my comments within I think the best approach would be to identify the performance problems (if any), write benchmarks to illustrate them, and then pursue optimizations on a case by case basis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also use an std::function for this, benchmarks would help show which is better. Either solution is going to be slower than the inlined version but if it is demonstrated to be a performance problem then we can revisit the template instantiation on a case by case basis to decide whether the tradeoff between code size and performance makes sense.

Additionally, in the future I think it could be beneficial to package the tensor support code in a libarrow_tensor in the future so that we could also "spin off" the tensor support into a separate pyarrow-tensor Python package so that users who do not need tensor support are not having to always carry around the compiled code that they may never use.

wesm
wesm previously approved these changes Jun 30, 2020
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, modulo CI. Thank you for using the GetByteWidth helper function in so many places =)

@mrkn
Copy link
Member Author

mrkn commented Jun 30, 2020

@wesm Is it better to work for benchmarking in other pull-request?

@wesm
Copy link
Member

wesm commented Jun 30, 2020

@mrkn it's up to you, it's fine with me if you work on performance tuning (or at least measurement) in another PR or this one

@mrkn
Copy link
Member Author

mrkn commented Jun 30, 2020

@wesm OK. I continue to work for benchmarking in this pull-request. If I need more time to tune etc., I'll split the issue.

@wesm wesm dismissed their stale review June 30, 2020 01:26

Awaiting benchmarking

@mrkn mrkn force-pushed the ARROW-9156 branch 2 times, most recently from ca93bc0 to bcc2ee3 Compare July 1, 2020 08:58
@mrkn
Copy link
Member Author

mrkn commented Jul 1, 2020

I wrote a benchmark code that measures the performance of conversion from Tensor to SparseTensor.
And I run this code with --repetitions=10 and got the following result.

  • Converting to SparseCOOTensor is 1.4x-1.8x slower than the original.
  • Converting to SparseCSRMatrix and SparseCSCMatrix is 1.0x-1.2x faster than the original.
  • Converting to SparseCSFTensor is 1.0-1.3x slower than the original.

I don't think this result, especially SparseCOOTensor's case can be acceptable.
Now I'm trying to resolve this performance regression.

The full result is shown below:
Format IndexType ValueType Change % Baseline Contender
COO Int32Type Int8 56.426 2459.397090 3847.138775
COO Int16Type Int8 43.062 2619.848028 3748.014603
COO Int64Type Int8 40.137 2850.826136 3995.057821
COO Int8Type Int8 35.478 2776.716694 3761.850287
COO Int16Type Int16 66.812 2394.011928 3993.489611
COO Int32Type Int16 59.281 2728.577337 4346.096797
COO Int64Type Int16 52.043 2596.404005 3947.663313
COO Int8Type Int16 44.510 2768.567992 4000.853968
COO Int32Type Float 93.283 2287.343291 4421.039474
COO Int16Type Float 71.100 2577.689569 4410.437858
COO Int64Type Float 61.035 2821.300996 4543.277655
COO Int8Type Float 53.303 2666.801559 4088.291362
COO Int64Type Double 88.230 2616.000738 4924.096764
COO Int16Type Double 70.188 2698.335495 4592.253023
COO Int32Type Double 69.368 2579.753902 4369.267270
COO Int8Type Double 62.781 2747.179806 4471.895533
CSR Int8Type Int8 -1.660 4626.181791 4549.389657
CSR Int32Type Int8 -4.718 4708.044556 4485.938002
CSR Int64Type Int8 -4.905 4687.961132 4458.024004
CSR Int16Type Int8 -7.164 4851.803479 4504.201161
CSR Int32Type Int16 -1.481 4722.676795 4652.711030
CSR Int64Type Int16 -5.853 4573.124314 4305.462839
CSR Int8Type Int16 -11.890 4631.290389 4080.628990
CSR Int16Type Int16 -12.372 4737.320600 4151.197839
CSR Int64Type Float -4.999 4711.575565 4476.032868
CSR Int16Type Float -5.062 4811.457037 4567.909440
CSR Int32Type Float -7.651 4814.813084 4446.432341
CSR Int8Type Float -12.196 5158.868330 4529.676101
CSR Int32Type Double 0.384 4631.005506 4648.789489
CSR Int64Type Double -0.273 4816.669765 4803.522209
CSR Int8Type Double -3.453 4646.652774 4486.217718
CSR Int16Type Double -8.963 4952.032110 4508.161865
CSC Int8Type Int8 -4.207 4871.793338 4666.848429
CSC Int64Type Int8 -10.582 4571.251613 4087.509465
CSC Int16Type Int8 -13.189 4865.666295 4223.934008
CSC Int32Type Int8 -16.546 4999.577506 4172.360711
CSC Int32Type Int16 -0.619 4842.515715 4812.531287
CSC Int8Type Int16 -4.128 4689.737235 4496.135413
CSC Int16Type Int16 -8.110 4643.413441 4266.836346
CSC Int64Type Int16 -10.302 4971.879449 4459.696887
CSC Int8Type Float -0.547 4694.347970 4668.667817
CSC Int32Type Float -0.619 4718.207593 4688.994670
CSC Int64Type Float -3.360 4847.753893 4684.849545
CSC Int16Type Float -7.408 4956.795348 4589.607722
CSC Int32Type Double 0.491 4917.172760 4941.321519
CSC Int64Type Double -0.754 5069.589290 5031.376333
CSC Int16Type Double -2.023 4762.425910 4666.071367
CSC Int8Type Double -5.987 4934.016268 4638.636384
CSF Int16Type Int8 25.394 7885.986312 9888.571309
CSF Int64Type Int8 22.605 9083.444559 11136.716151
CSF Int32Type Int8 13.486 8228.377064 9338.031561
CSF Int8Type Int8 11.900 7687.347967 8602.168534
CSF Int64Type Int16 28.094 8247.255772 10564.280045
CSF Int8Type Int16 20.276 8172.411486 9829.454135
CSF Int32Type Int16 15.240 8776.213727 10113.751856
CSF Int16Type Int16 13.707 8396.322326 9547.223214
CSF Int16Type Float 25.989 8340.996459 10508.758442
CSF Int32Type Float 20.562 8882.075299 10708.400993
CSF Int8Type Float 19.021 8507.160025 10125.299339
CSF Int64Type Float 1.769 9657.000323 9827.819671
CSF Int8Type Double 20.382 8380.183782 10088.211026
CSF Int32Type Double 14.353 9512.403696 10877.711426
CSF Int16Type Double 13.737 8536.269931 9708.870939
CSF Int64Type Double 12.413 9737.555217 10946.296736

@mrkn mrkn force-pushed the ARROW-9156 branch 3 times, most recently from ef7781b to 59df077 Compare July 2, 2020 05:42
@wesm
Copy link
Member

wesm commented Jul 2, 2020

@mrkn to improve the benchmark usefulness I would recommend increasing the size of the data being processed. I ran them locally and the COO benchmarks all run in less than 10 microseconds (some close to 1 microsecond) and at that speed things like destructors show up as non-trivial overhead

+   91.29%     0.50%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] arrow::Int8RowMajorTensorConversionFixture_ConvertToSparseCOOTensorInt32_Benchmark::BenchmarkCase
+   91.24%     0.00%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] _start
+   91.24%     0.00%  arrow-tensor-co  libc-2.27.so                       [.] __libc_start_main
+   91.24%     0.00%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] main
+   91.24%     0.00%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] benchmark::RunSpecifiedBenchmarks
+   79.50%     3.05%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] arrow::SparseTensorImpl<arrow::SparseCOOIndex>::Make
+   74.47%     2.29%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::internal::MakeSparseCOOTensorFromTensor
+   74.47%     0.04%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::internal::MakeSparseTensorFromTensor
+   74.33%    16.65%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::internal::(anonymous namespace)::SparseCOOTensorConverter::Convert
+   25.47%    25.47%  arrow-tensor-co  libc-2.27.so                       [.] __memmove_avx_unaligned_erms
+   12.64%     7.00%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::internal::ComputeRowMajorStrides
+   11.35%     1.87%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] arrow::SparseTensor::~SparseTensor
+    9.15%     0.26%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::internal::IsTensorStridesContiguous
+    7.76%     2.52%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::Tensor::CountNonZero
+    7.25%     0.57%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::AllocateBuffer
+    6.99%     6.99%  arrow-tensor-co  libc-2.27.so                       [.] cfree@GLIBC_2.2.5
+    6.99%     0.00%  arrow-tensor-co  libc-2.27.so                       [.] __GI___libc_free (inlined)
+    6.76%     1.07%  arrow-tensor-co  libstdc++.so.6.0.27                [.] operator new
+    6.03%     0.00%  arrow-tensor-co  libc-2.27.so                       [.] __GI___libc_malloc (inlined)
+    5.73%     1.71%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::PoolBuffer::~PoolBuffer
+    5.42%     1.35%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::SparseCOOIndex::SparseCOOIndex
+    4.99%     0.41%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::SparseCOOIndex::~SparseCOOIndex

@wesm
Copy link
Member

wesm commented Jul 3, 2020

What do you think about pursuing the performance optimization work as a follow up?

@mrkn
Copy link
Member Author

mrkn commented Jul 3, 2020

@wesm I decided to separate a pull-request for performance optimization because I may need some days to get this work done.
I'll make a new ticket for optimization, and clean up this pull-request.

@wesm
Copy link
Member

wesm commented Jul 3, 2020

OK, sounds good, let me know when this is ready to be merged

@mrkn mrkn force-pushed the ARROW-9156 branch 2 times, most recently from 732e087 to 5523588 Compare July 4, 2020 01:31
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, CI is looking OK now. Thanks @mrkn!

@wesm wesm closed this in 16b2a44 Jul 5, 2020
@mrkn mrkn deleted the ARROW-9156 branch July 5, 2020 16:48
mrkn added a commit to mrkn/arrow that referenced this pull request Jul 10, 2020
This change improves the conversion speed for all the cases of row-major
and column-major tensors.  For strided tensors, all the cases are improved
except for the combination of int16 value and less than 32-bit index.

The result from `archery benchmark diff` command is below, the baseline is
the commit 8f96d1d (before merging apache#7539) and the contender is this commit:

```
                                                                            benchmark       baseline      contender  change %
43       Int16StridedTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  141564.498765  182313.374077    28.785
10       Int16StridedTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  140420.265077  153715.618715     9.468
42         Int16StridedTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  167601.944005  170626.538009     1.805
37       Int16StridedTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  143722.048451  141928.779114    -1.248
27        Int8StridedTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  169947.630903  164423.055008    -3.251
24          Int8StridedTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  170153.324442  163898.373534    -3.676
45        Int8StridedTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  170883.542468  164131.618700    -3.951
35        Int8StridedTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  171015.028153  163516.191034    -4.385
9         DoubleStridedTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  200974.675587  191956.688874    -4.487
18         FloatStridedTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  192320.819787  182941.130595    -4.877
12      DoubleStridedTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  175198.892973  166417.452194    -5.012
30       FloatStridedTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  167174.764713  151431.022906    -9.418
29      DoubleStridedTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  173925.990981  157142.110096    -9.650
16       FloatStridedTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  167877.497573  151666.610814    -9.656
26       FloatStridedTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  169705.312801  151885.952803   -10.500
6       DoubleStridedTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  177394.661870  156019.301906   -12.050
5       Int16RowMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  107592.839089   66069.770737   -38.593
41   Int16ColumnMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  114841.700196   68707.073774   -40.172
47      Int16RowMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  107304.436017   63922.898636   -40.428
4       FloatRowMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  112315.965200   66577.854744   -40.723
21   Int16ColumnMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  115090.317912   66527.852021   -42.195
17   FloatColumnMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  121583.540341   70025.614174   -42.405
3      DoubleRowMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  116946.572632   66411.338694   -43.212
15      FloatRowMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  112275.805149   63264.226406   -43.653
13   FloatColumnMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  122085.596559   66569.027159   -45.473
34      Int16RowMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  109888.801628   58860.826009   -46.436
20   Int16ColumnMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  117648.480324   62574.709433   -46.812
19      Int8ColumnMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  137444.576787   71969.132261   -47.638
28     DoubleRowMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  119527.435615   61405.371141   -48.627
40      FloatRowMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  115130.821188   58664.779831   -49.045
39    Int8ColumnMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  137053.503574   69755.112894   -49.104
22         Int8RowMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  136645.576795   69303.266896   -49.282
23   FloatColumnMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  124100.575779   61723.051518   -50.264
31  DoubleColumnMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  140278.467902   69584.530347   -50.395
1         Int16RowMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  135770.669563   67151.922438   -50.540
44     Int16ColumnMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  142625.928542   70315.759868   -50.699
2     Int8ColumnMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  137443.030096   67752.813535   -50.705
46       Int8RowMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16  135961.160225   66613.351871   -51.006
11  DoubleColumnMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  138857.793332   67315.714410   -51.522
8         FloatRowMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  138992.703542   66847.061004   -51.906
7        Int8RowMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  136298.424804   64520.497064   -52.662
36     FloatColumnMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  149706.883716   69805.958679   -53.372
33       DoubleRowMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  143460.582904   66870.585026   -53.387
38  DoubleColumnMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  138220.367601   64425.776453   -53.389
14     DoubleRowMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32  136707.421042   63624.050357   -53.460
25    Int8ColumnMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  137303.219403   62528.740787   -54.459
32       Int8RowMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64  136551.052565   58743.141699   -56.981
0     DoubleColumnMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8  162895.437265   69676.279783   -57.226
```
wesm pushed a commit that referenced this pull request Jul 11, 2020
…onversion

In this pull-request, the slowing down of the conversion introduced in #7539 is canceled, and the conversion speed is improved than before #7539 in some cases.

Closes #7643 from mrkn/ARROW-9331

Authored-by: Kenta Murata <mrkn@mrkn.jp>
Signed-off-by: Wes McKinney <wesm@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants