ARROW-9156: [C++] Reducing the code size of the tensor module #7539

mrkn · 2020-06-25T06:28:01Z

To reduce the size of libarrow.so, I'd like to reduce the size of tensor and sparse tensor codes.

TODO:

Stop using template parameters in Tensor to SparseTensor converters
Stop using template parameters in SparseTensor to Tensor converters

github-actions · 2020-06-25T06:31:47Z

https://issues.apache.org/jira/browse/ARROW-9156

wesm · 2020-06-25T14:00:31Z

This removes already more than 2MB of code from libarrow.so on Linux: great. I'll keep an eye on this

mrkn · 2020-06-29T06:49:43Z

@wesm Could you please review this?

wesm

Thank you for working on this, this is definitely very beneficial both for compilation time and code size. I left some minor comments about reducing logic duplication in a few places by factoring some things out into a helper function but otherwise this looks good, and the existing test suite gives good evidence that things are working as expected

I'm sensitive to the possible performance issues that this refactor may introduce (by changing inline functions to C function calls or functions with switch statements in them but per my comments within I think the best approach would be to identify the performance problems (if any), write benchmarks to illustrate them, and then pursue optimizations on a case by case basis.

cpp/src/arrow/tensor/csx_converter.cc

cpp/src/arrow/tensor/csf_converter.cc

cpp/src/arrow/tensor/coo_converter.cc

wesm · 2020-06-29T14:54:33Z

cpp/src/arrow/tensor/coo_converter.cc

You could also use an std::function for this, benchmarks would help show which is better. Either solution is going to be slower than the inlined version but if it is demonstrated to be a performance problem then we can revisit the template instantiation on a case by case basis to decide whether the tradeoff between code size and performance makes sense.

Additionally, in the future I think it could be beneficial to package the tensor support code in a libarrow_tensor in the future so that we could also "spin off" the tensor support into a separate pyarrow-tensor Python package so that users who do not need tensor support are not having to always carry around the compiled code that they may never use.

cpp/src/arrow/tensor/csf_converter.cc

wesm

+1, modulo CI. Thank you for using the GetByteWidth helper function in so many places =)

mrkn · 2020-06-30T01:10:38Z

@wesm Is it better to work for benchmarking in other pull-request?

wesm · 2020-06-30T01:12:13Z

@mrkn it's up to you, it's fine with me if you work on performance tuning (or at least measurement) in another PR or this one

mrkn · 2020-06-30T01:15:36Z

@wesm OK. I continue to work for benchmarking in this pull-request. If I need more time to tune etc., I'll split the issue.

Awaiting benchmarking

mrkn · 2020-07-01T09:15:09Z

I wrote a benchmark code that measures the performance of conversion from Tensor to SparseTensor.
And I run this code with --repetitions=10 and got the following result.

Converting to SparseCOOTensor is 1.4x-1.8x slower than the original.
Converting to SparseCSRMatrix and SparseCSCMatrix is 1.0x-1.2x faster than the original.
Converting to SparseCSFTensor is 1.0-1.3x slower than the original.

I don't think this result, especially SparseCOOTensor's case can be acceptable.
Now I'm trying to resolve this performance regression.

The full result is shown below:

Format	IndexType	ValueType	Change %	Baseline	Contender
COO	Int32Type	Int8	56.426	2459.397090	3847.138775
COO	Int16Type	Int8	43.062	2619.848028	3748.014603
COO	Int64Type	Int8	40.137	2850.826136	3995.057821
COO	Int8Type	Int8	35.478	2776.716694	3761.850287
COO	Int16Type	Int16	66.812	2394.011928	3993.489611
COO	Int32Type	Int16	59.281	2728.577337	4346.096797
COO	Int64Type	Int16	52.043	2596.404005	3947.663313
COO	Int8Type	Int16	44.510	2768.567992	4000.853968
COO	Int32Type	Float	93.283	2287.343291	4421.039474
COO	Int16Type	Float	71.100	2577.689569	4410.437858
COO	Int64Type	Float	61.035	2821.300996	4543.277655
COO	Int8Type	Float	53.303	2666.801559	4088.291362
COO	Int64Type	Double	88.230	2616.000738	4924.096764
COO	Int16Type	Double	70.188	2698.335495	4592.253023
COO	Int32Type	Double	69.368	2579.753902	4369.267270
COO	Int8Type	Double	62.781	2747.179806	4471.895533
CSR	Int8Type	Int8	-1.660	4626.181791	4549.389657
CSR	Int32Type	Int8	-4.718	4708.044556	4485.938002
CSR	Int64Type	Int8	-4.905	4687.961132	4458.024004
CSR	Int16Type	Int8	-7.164	4851.803479	4504.201161
CSR	Int32Type	Int16	-1.481	4722.676795	4652.711030
CSR	Int64Type	Int16	-5.853	4573.124314	4305.462839
CSR	Int8Type	Int16	-11.890	4631.290389	4080.628990
CSR	Int16Type	Int16	-12.372	4737.320600	4151.197839
CSR	Int64Type	Float	-4.999	4711.575565	4476.032868
CSR	Int16Type	Float	-5.062	4811.457037	4567.909440
CSR	Int32Type	Float	-7.651	4814.813084	4446.432341
CSR	Int8Type	Float	-12.196	5158.868330	4529.676101
CSR	Int32Type	Double	0.384	4631.005506	4648.789489
CSR	Int64Type	Double	-0.273	4816.669765	4803.522209
CSR	Int8Type	Double	-3.453	4646.652774	4486.217718
CSR	Int16Type	Double	-8.963	4952.032110	4508.161865
CSC	Int8Type	Int8	-4.207	4871.793338	4666.848429
CSC	Int64Type	Int8	-10.582	4571.251613	4087.509465
CSC	Int16Type	Int8	-13.189	4865.666295	4223.934008
CSC	Int32Type	Int8	-16.546	4999.577506	4172.360711
CSC	Int32Type	Int16	-0.619	4842.515715	4812.531287
CSC	Int8Type	Int16	-4.128	4689.737235	4496.135413
CSC	Int16Type	Int16	-8.110	4643.413441	4266.836346
CSC	Int64Type	Int16	-10.302	4971.879449	4459.696887
CSC	Int8Type	Float	-0.547	4694.347970	4668.667817
CSC	Int32Type	Float	-0.619	4718.207593	4688.994670
CSC	Int64Type	Float	-3.360	4847.753893	4684.849545
CSC	Int16Type	Float	-7.408	4956.795348	4589.607722
CSC	Int32Type	Double	0.491	4917.172760	4941.321519
CSC	Int64Type	Double	-0.754	5069.589290	5031.376333
CSC	Int16Type	Double	-2.023	4762.425910	4666.071367
CSC	Int8Type	Double	-5.987	4934.016268	4638.636384
CSF	Int16Type	Int8	25.394	7885.986312	9888.571309
CSF	Int64Type	Int8	22.605	9083.444559	11136.716151
CSF	Int32Type	Int8	13.486	8228.377064	9338.031561
CSF	Int8Type	Int8	11.900	7687.347967	8602.168534
CSF	Int64Type	Int16	28.094	8247.255772	10564.280045
CSF	Int8Type	Int16	20.276	8172.411486	9829.454135
CSF	Int32Type	Int16	15.240	8776.213727	10113.751856
CSF	Int16Type	Int16	13.707	8396.322326	9547.223214
CSF	Int16Type	Float	25.989	8340.996459	10508.758442
CSF	Int32Type	Float	20.562	8882.075299	10708.400993
CSF	Int8Type	Float	19.021	8507.160025	10125.299339
CSF	Int64Type	Float	1.769	9657.000323	9827.819671
CSF	Int8Type	Double	20.382	8380.183782	10088.211026
CSF	Int32Type	Double	14.353	9512.403696	10877.711426
CSF	Int16Type	Double	13.737	8536.269931	9708.870939
CSF	Int64Type	Double	12.413	9737.555217	10946.296736

wesm · 2020-07-02T13:59:52Z

@mrkn to improve the benchmark usefulness I would recommend increasing the size of the data being processed. I ran them locally and the COO benchmarks all run in less than 10 microseconds (some close to 1 microsecond) and at that speed things like destructors show up as non-trivial overhead

+   91.29%     0.50%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] arrow::Int8RowMajorTensorConversionFixture_ConvertToSparseCOOTensorInt32_Benchmark::BenchmarkCase
+   91.24%     0.00%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] _start
+   91.24%     0.00%  arrow-tensor-co  libc-2.27.so                       [.] __libc_start_main
+   91.24%     0.00%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] main
+   91.24%     0.00%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] benchmark::RunSpecifiedBenchmarks
+   79.50%     3.05%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] arrow::SparseTensorImpl<arrow::SparseCOOIndex>::Make
+   74.47%     2.29%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::internal::MakeSparseCOOTensorFromTensor
+   74.47%     0.04%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::internal::MakeSparseTensorFromTensor
+   74.33%    16.65%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::internal::(anonymous namespace)::SparseCOOTensorConverter::Convert
+   25.47%    25.47%  arrow-tensor-co  libc-2.27.so                       [.] __memmove_avx_unaligned_erms
+   12.64%     7.00%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::internal::ComputeRowMajorStrides
+   11.35%     1.87%  arrow-tensor-co  arrow-tensor-conversion-benchmark  [.] arrow::SparseTensor::~SparseTensor
+    9.15%     0.26%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::internal::IsTensorStridesContiguous
+    7.76%     2.52%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::Tensor::CountNonZero
+    7.25%     0.57%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::AllocateBuffer
+    6.99%     6.99%  arrow-tensor-co  libc-2.27.so                       [.] cfree@GLIBC_2.2.5
+    6.99%     0.00%  arrow-tensor-co  libc-2.27.so                       [.] __GI___libc_free (inlined)
+    6.76%     1.07%  arrow-tensor-co  libstdc++.so.6.0.27                [.] operator new
+    6.03%     0.00%  arrow-tensor-co  libc-2.27.so                       [.] __GI___libc_malloc (inlined)
+    5.73%     1.71%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::PoolBuffer::~PoolBuffer
+    5.42%     1.35%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::SparseCOOIndex::SparseCOOIndex
+    4.99%     0.41%  arrow-tensor-co  libarrow.so.100.0.0                [.] arrow::SparseCOOIndex::~SparseCOOIndex

wesm · 2020-07-03T00:03:55Z

What do you think about pursuing the performance optimization work as a follow up?

mrkn · 2020-07-03T13:54:47Z

@wesm I decided to separate a pull-request for performance optimization because I may need some days to get this work done.
I'll make a new ticket for optimization, and clean up this pull-request.

wesm · 2020-07-03T13:55:47Z

OK, sounds good, let me know when this is ready to be merged

wesm

+1, CI is looking OK now. Thanks @mrkn!

This change improves the conversion speed for all the cases of row-major and column-major tensors. For strided tensors, all the cases are improved except for the combination of int16 value and less than 32-bit index. The result from `archery benchmark diff` command is below, the baseline is the commit 8f96d1d (before merging apache#7539) and the contender is this commit: ``` benchmark baseline contender change % 43 Int16StridedTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 141564.498765 182313.374077 28.785 10 Int16StridedTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 140420.265077 153715.618715 9.468 42 Int16StridedTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 167601.944005 170626.538009 1.805 37 Int16StridedTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 143722.048451 141928.779114 -1.248 27 Int8StridedTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 169947.630903 164423.055008 -3.251 24 Int8StridedTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 170153.324442 163898.373534 -3.676 45 Int8StridedTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 170883.542468 164131.618700 -3.951 35 Int8StridedTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 171015.028153 163516.191034 -4.385 9 DoubleStridedTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 200974.675587 191956.688874 -4.487 18 FloatStridedTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 192320.819787 182941.130595 -4.877 12 DoubleStridedTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 175198.892973 166417.452194 -5.012 30 FloatStridedTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 167174.764713 151431.022906 -9.418 29 DoubleStridedTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 173925.990981 157142.110096 -9.650 16 FloatStridedTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 167877.497573 151666.610814 -9.656 26 FloatStridedTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 169705.312801 151885.952803 -10.500 6 DoubleStridedTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 177394.661870 156019.301906 -12.050 5 Int16RowMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 107592.839089 66069.770737 -38.593 41 Int16ColumnMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 114841.700196 68707.073774 -40.172 47 Int16RowMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 107304.436017 63922.898636 -40.428 4 FloatRowMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 112315.965200 66577.854744 -40.723 21 Int16ColumnMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 115090.317912 66527.852021 -42.195 17 FloatColumnMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 121583.540341 70025.614174 -42.405 3 DoubleRowMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 116946.572632 66411.338694 -43.212 15 FloatRowMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 112275.805149 63264.226406 -43.653 13 FloatColumnMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 122085.596559 66569.027159 -45.473 34 Int16RowMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 109888.801628 58860.826009 -46.436 20 Int16ColumnMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 117648.480324 62574.709433 -46.812 19 Int8ColumnMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 137444.576787 71969.132261 -47.638 28 DoubleRowMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 119527.435615 61405.371141 -48.627 40 FloatRowMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 115130.821188 58664.779831 -49.045 39 Int8ColumnMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 137053.503574 69755.112894 -49.104 22 Int8RowMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 136645.576795 69303.266896 -49.282 23 FloatColumnMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 124100.575779 61723.051518 -50.264 31 DoubleColumnMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 140278.467902 69584.530347 -50.395 1 Int16RowMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 135770.669563 67151.922438 -50.540 44 Int16ColumnMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 142625.928542 70315.759868 -50.699 2 Int8ColumnMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 137443.030096 67752.813535 -50.705 46 Int8RowMajorTensorConversionFixture<Int16Type>/ConvertToSparseCOOTensorInt16 135961.160225 66613.351871 -51.006 11 DoubleColumnMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 138857.793332 67315.714410 -51.522 8 FloatRowMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 138992.703542 66847.061004 -51.906 7 Int8RowMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 136298.424804 64520.497064 -52.662 36 FloatColumnMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 149706.883716 69805.958679 -53.372 33 DoubleRowMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 143460.582904 66870.585026 -53.387 38 DoubleColumnMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 138220.367601 64425.776453 -53.389 14 DoubleRowMajorTensorConversionFixture<Int32Type>/ConvertToSparseCOOTensorInt32 136707.421042 63624.050357 -53.460 25 Int8ColumnMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 137303.219403 62528.740787 -54.459 32 Int8RowMajorTensorConversionFixture<Int64Type>/ConvertToSparseCOOTensorInt64 136551.052565 58743.141699 -56.981 0 DoubleColumnMajorTensorConversionFixture<Int8Type>/ConvertToSparseCOOTensorInt8 162895.437265 69676.279783 -57.226 ```

…onversion In this pull-request, the slowing down of the conversion introduced in #7539 is canceled, and the conversion speed is improved than before #7539 in some cases. Closes #7643 from mrkn/ARROW-9331 Authored-by: Kenta Murata <mrkn@mrkn.jp> Signed-off-by: Wes McKinney <wesm@apache.org>

mrkn marked this pull request as draft June 25, 2020 06:28

mrkn force-pushed the ARROW-9156 branch 2 times, most recently from af20396 to 1e54d34 Compare June 29, 2020 05:33

mrkn marked this pull request as ready for review June 29, 2020 06:48

mrkn requested a review from wesm June 29, 2020 06:49

wesm reviewed Jun 29, 2020

View reviewed changes

wesm previously approved these changes Jun 30, 2020

View reviewed changes

mrkn force-pushed the ARROW-9156 branch 2 times, most recently from ca93bc0 to bcc2ee3 Compare July 1, 2020 08:58

mrkn force-pushed the ARROW-9156 branch 3 times, most recently from ef7781b to 59df077 Compare July 2, 2020 05:42

mrkn force-pushed the ARROW-9156 branch from 59df077 to 5793bbf Compare July 3, 2020 00:24

mrkn force-pushed the ARROW-9156 branch 2 times, most recently from 732e087 to 5523588 Compare July 4, 2020 01:31

mrkn added 3 commits July 4, 2020 19:53

Add benchmark

237ea89

Update benchmark

cae3455

Fix CheckTensorStridesValidity

8e86933

mrkn added 17 commits July 4, 2020 19:53

Remove a template parameter from SparseCOOTensorConverter

2563ef6

Remove template parameter from SparseCOOTensorConverter::Convert

893ecd7

Add index value type check in SparseCOOTensorConverter

5a5c696

Fix code format

b811140

Remove needless template functions

a93af26

Introduce SparseTensorConverterMixin

dcc1e14

Remove template parameters from SparseCSRMatrixConverter

6bf0a15

Unite SparseCSRMatrixConverter and SparseCSCMatrixConverter

a8b22f0

Remove template parameters from SparseCSFTensorConverter

dba7c07

Check the index value type when making a sparse tensor

838de1c

Use SparseCOOTensor::Make instead of the constructor

1d6870a

Drop needless test cases

cb86fa5

Split MakeTensorFromSparseTensor

5049b3a

Remove needless functions

646b545

Add internal::GetByteWidth and use it

fefc3a5

Fix benchmark

2b926b3

Need to put ARROW_EXPORT

f583984

wesm force-pushed the ARROW-9156 branch from 5523588 to b468420 Compare July 5, 2020 00:53

Fix namespace against compilation error on Windows

6da96d0

mrkn force-pushed the ARROW-9156 branch from b468420 to 6da96d0 Compare July 5, 2020 14:58

wesm approved these changes Jul 5, 2020

View reviewed changes

wesm closed this in 16b2a44 Jul 5, 2020

mrkn deleted the ARROW-9156 branch July 5, 2020 16:48

mrkn mentioned this pull request Jul 6, 2020

ARROW-9331: [C++] Improve the performance of Tensor-to-SparseTensor conversion #7643

Closed

asfimport mentioned this pull request Jul 5, 2020

[C++] Reducing the code size of the tensor module #17231

Closed

ARROW-9156: [C++] Reducing the code size of the tensor module #7539

ARROW-9156: [C++] Reducing the code size of the tensor module #7539

Uh oh!

Conversation

mrkn commented Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 25, 2020

Uh oh!

wesm commented Jun 25, 2020

Uh oh!

mrkn commented Jun 29, 2020

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wesm Jun 29, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

mrkn commented Jun 30, 2020

Uh oh!

wesm commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrkn commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrkn commented Jul 1, 2020

Uh oh!

wesm commented Jul 2, 2020

Uh oh!

wesm commented Jul 3, 2020

Uh oh!

mrkn commented Jul 3, 2020

Uh oh!

wesm commented Jul 3, 2020

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mrkn commented Jun 25, 2020 •

edited

Loading

wesm commented Jun 30, 2020 •

edited

Loading

mrkn commented Jun 30, 2020 •

edited

Loading