Skip to content

Conversation

@lidavidm
Copy link
Member

This removes fill_null in favor of coalesce.

A few points:

  • This leaves behind the Python aliases for convenience.
  • This is about 25% slower than fill_null for the (Array, Scalar) case for fixed-width types. As far as I can see, this is mostly because the handwritten fill_null only unboxes the scalar once (and does so cheaply), while the generic implementation here unboxes the scalar on every loop (the compiler doesn't manage to hoist it) and UnboxScalar is relatively expensive (requiring a virtual call). I tried to make UnboxScalar less expensive but couldn't avoid undefined behavior/incorrect results for the DayTimeIntervalType case.
  • This specializes all the two-argument cases for fixed-width types but it might not really be worth it except for the (Array, Scalar) case.

@github-actions
Copy link

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried using BitRunReader instead?

(er, perhaps we already had this conversation? :-S)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not helpful in case_when but might be useful here, I can give it a try.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may largely depend on the clustering of nulls/non-nulls, hence the suggestion to benchmark different null probabilites.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it does better, though I could imagine it doing badly at ~50% nulls.

--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
Before:
CoalesceScalarBench64/0        1321249 ns      1321175 ns          508 bytes_per_second=5.9133G/s items_per_second=793.67M/s length=1048.58k null%=1 num_args=2
CoalesceScalarBench64/2        5049228 ns      5049073 ns          133 bytes_per_second=1.54731G/s items_per_second=207.677M/s length=1048.58k null%=25 num_args=2
CoalesceScalarStringBench/0   87407512 ns     87404678 ns            7 bytes_per_second=5.70364G/s items_per_second=11.9968M/s length=1048.58k null%=1 num_args=2
CoalesceScalarStringBench/2   77686321 ns     77685936 ns            8 bytes_per_second=4.87622G/s items_per_second=13.4976M/s length=1048.58k null%=25 num_args=2
After:
CoalesceScalarBench64/0         988642 ns       988606 ns          664 bytes_per_second=7.90255G/s items_per_second=1060.66M/s length=1048.58k null%=1 num_args=2
CoalesceScalarBench64/2        4537487 ns      4537432 ns          154 bytes_per_second=1.72179G/s items_per_second=231.095M/s length=1048.58k null%=25 num_args=2
CoalesceScalarStringBench/0  103582667 ns    103575544 ns            6 bytes_per_second=4.81315G/s items_per_second=10.1238M/s length=1048.58k null%=1 num_args=2
CoalesceScalarStringBench/2   79813292 ns     79807705 ns            8 bytes_per_second=4.74658G/s items_per_second=13.1388M/s length=1048.58k null%=25 num_args=2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for completeness, I tested 50% and 99% nulls too. BitRunReader is still faster overall, maybe a bit slower with 50% nulls with strings.

--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BitBlockCounter:
CoalesceScalarBench64/0        1488163 ns      1488088 ns          457 bytes_per_second=5.25003G/s items_per_second=704.647M/s length=1048.58k null%=1 num_args=2
CoalesceScalarBench64/2        5307719 ns      5307520 ns          133 bytes_per_second=1.47197G/s items_per_second=197.564M/s length=1048.58k null%=25 num_args=2
CoalesceScalarBench64/4        8338070 ns      8337713 ns           83 bytes_per_second=959.496M/s items_per_second=125.763M/s length=1048.58k null%=50 num_args=2
CoalesceScalarBench64/6        2187502 ns      2187409 ns          323 bytes_per_second=3.57158G/s items_per_second=479.369M/s length=1048.58k null%=99 num_args=2
CoalesceScalarStringBench/0  112844773 ns    112838638 ns            5 bytes_per_second=4.41803G/s items_per_second=9.2927M/s length=1048.58k null%=1 num_args=2
CoalesceScalarStringBench/2   94380710 ns     94376399 ns            8 bytes_per_second=4.01386G/s items_per_second=11.1106M/s length=1048.58k null%=25 num_args=2
CoalesceScalarStringBench/4   54766774 ns     54766616 ns           10 bytes_per_second=4.64332G/s items_per_second=19.1463M/s length=1048.58k null%=50 num_args=2
CoalesceScalarStringBench/6    6268787 ns      6268586 ns          108 bytes_per_second=1.41205G/s items_per_second=167.275M/s length=1048.58k null%=99 num_args=2
BitRunReader:
CoalesceScalarBench64/0         994000 ns       994007 ns          717 bytes_per_second=7.8596G/s items_per_second=1054.9M/s length=1048.58k null%=1 num_args=2
CoalesceScalarBench64/2        4624991 ns      4625012 ns          153 bytes_per_second=1.68918G/s items_per_second=226.719M/s length=1048.58k null%=25 num_args=2
CoalesceScalarBench64/4        7016190 ns      7016189 ns          102 bytes_per_second=1.1135G/s items_per_second=149.451M/s length=1048.58k null%=50 num_args=2
CoalesceScalarBench64/6        1013352 ns      1013358 ns          672 bytes_per_second=7.70952G/s items_per_second=1034.75M/s length=1048.58k null%=99 num_args=2
CoalesceScalarStringBench/0  110433694 ns    110433042 ns            5 bytes_per_second=4.51427G/s items_per_second=9.49513M/s length=1048.58k null%=1 num_args=2
CoalesceScalarStringBench/2   79904343 ns     79904642 ns            7 bytes_per_second=4.74082G/s items_per_second=13.1228M/s length=1048.58k null%=25 num_args=2
CoalesceScalarStringBench/4   61812967 ns     61813273 ns           10 bytes_per_second=4.11398G/s items_per_second=16.9636M/s length=1048.58k null%=50 num_args=2
CoalesceScalarStringBench/6    6219633 ns      6219571 ns          105 bytes_per_second=1.42317G/s items_per_second=168.593M/s length=1048.58k null%=99 num_args=2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the numbers!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be moved up before // Array with nulls?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for left_valid to be null?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to also parametrize the null_probability used in the coalesce benchmarks? I think we have some logic to do this (see e.g. RegressionArgs and the Take benchmarks).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for left_valid to be null? If there are no nulls in the left argument, I would expect this to shortcut.

@lidavidm lidavidm force-pushed the arrow-7179 branch 2 times, most recently from f46a228 to 2dc5e0d Compare August 26, 2021 13:07
@pitrou
Copy link
Member

pitrou commented Aug 26, 2021

I've submitted a PR to your branch to improve performance:
lidavidm#4

I think the same approach may be used for other conditional kernels, but that would be a separate JIRA.

lidavidm and others added 14 commits August 26, 2021 13:23
Before:
```
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
CoalesceBench64/0              3366683 ns      3364669 ns          213 bytes_per_second=2.32192G/s items_per_second=311.643M/s length=1048.58k null%=1 num_args=2
CoalesceBench64/1              2851663 ns      2849846 ns          245 bytes_per_second=2.74138G/s items_per_second=367.941M/s length=1048.58k null%=1 num_args=4
CoalesceBench64/2              7499865 ns      7495813 ns           95 bytes_per_second=1067.26M/s items_per_second=139.888M/s length=1048.58k null%=25 num_args=2
CoalesceBench64/3             11773437 ns     11766272 ns           60 bytes_per_second=679.909M/s items_per_second=89.1171M/s length=1048.58k null%=25 num_args=4
CoalesceBench64/4              9636207 ns      9631169 ns           73 bytes_per_second=830.636M/s items_per_second=108.873M/s length=1048.58k null%=50 num_args=2
CoalesceBench64/5             19456855 ns     19445858 ns           36 bytes_per_second=411.399M/s items_per_second=53.9228M/s length=1048.58k null%=50 num_args=4
CoalesceBench64/6              3288217 ns      3286426 ns          214 bytes_per_second=2.3772G/s items_per_second=319.063M/s length=1048.58k null%=99 num_args=2
CoalesceBench64/7              7603232 ns      7599720 ns           92 bytes_per_second=1052.67M/s items_per_second=137.976M/s length=1048.58k null%=99 num_args=4
CoalesceScalarBench64/0         775260 ns       774797 ns          904 bytes_per_second=10.0833G/s items_per_second=1.35336G/s length=1048.58k null%=1 num_args=2
CoalesceScalarBench64/2        3500267 ns      3498388 ns          201 bytes_per_second=2.23317G/s items_per_second=299.731M/s length=1048.58k null%=25 num_args=2
CoalesceScalarBench64/4        4815186 ns      4812821 ns          146 bytes_per_second=1.62327G/s items_per_second=217.871M/s length=1048.58k null%=50 num_args=2
CoalesceScalarBench64/6         446897 ns       446783 ns         1541 bytes_per_second=17.4861G/s items_per_second=2.34695G/s length=1048.58k null%=99 num_args=2
CoalesceScalarStringBench/0   74138532 ns     74089097 ns           10 bytes_per_second=6.72872G/s items_per_second=14.1529M/s length=1048.58k null%=1 num_args=2
CoalesceScalarStringBench/2   58106933 ns     58064020 ns            9 bytes_per_second=6.52407G/s items_per_second=18.059M/s length=1048.58k null%=25 num_args=2
CoalesceScalarStringBench/4   52094990 ns     52064312 ns           10 bytes_per_second=4.88432G/s items_per_second=20.14M/s length=1048.58k null%=50 num_args=2
CoalesceScalarStringBench/6    5136540 ns      5133121 ns          138 bytes_per_second=1.7244G/s items_per_second=204.276M/s length=1048.58k null%=99 num_args=2
```

After:
```
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
CoalesceBench64/0              1047061 ns      1046399 ns          661 bytes_per_second=7.46608G/s items_per_second=1002.08M/s length=1048.58k null%=1 num_args=2
CoalesceBench64/1              1377282 ns      1376405 ns          511 bytes_per_second=5.67602G/s items_per_second=761.822M/s length=1048.58k null%=1 num_args=4
CoalesceBench64/2              2804061 ns      2802178 ns          251 bytes_per_second=2.78801G/s items_per_second=374.2M/s length=1048.58k null%=25 num_args=2
CoalesceBench64/3              5234633 ns      5230898 ns          134 bytes_per_second=1.49353G/s items_per_second=200.458M/s length=1048.58k null%=25 num_args=4
CoalesceBench64/4              3700820 ns      3698116 ns          190 bytes_per_second=2.11256G/s items_per_second=283.543M/s length=1048.58k null%=50 num_args=2
CoalesceBench64/5              7731316 ns      7726379 ns           90 bytes_per_second=1035.41M/s items_per_second=135.714M/s length=1048.58k null%=50 num_args=4
CoalesceBench64/6              1004359 ns      1003745 ns          693 bytes_per_second=7.78335G/s items_per_second=1044.66M/s length=1048.58k null%=99 num_args=2
CoalesceBench64/7              4660379 ns      4658001 ns          151 bytes_per_second=1.67722G/s items_per_second=225.113M/s length=1048.58k null%=99 num_args=4
CoalesceScalarBench64/0         656265 ns       655870 ns         1067 bytes_per_second=11.9117G/s items_per_second=1.59876G/s length=1048.58k null%=1 num_args=2
CoalesceScalarBench64/2        2889294 ns      2887898 ns          242 bytes_per_second=2.70525G/s items_per_second=363.093M/s length=1048.58k null%=25 num_args=2
CoalesceScalarBench64/4        4015990 ns      4014054 ns          175 bytes_per_second=1.94629G/s items_per_second=261.226M/s length=1048.58k null%=50 num_args=2
CoalesceScalarBench64/6         390245 ns       390138 ns         1800 bytes_per_second=20.025G/s items_per_second=2.68771G/s length=1048.58k null%=99 num_args=2
CoalesceScalarStringBench/0   82277097 ns     82223643 ns            9 bytes_per_second=6.06303G/s items_per_second=12.7527M/s length=1048.58k null%=1 num_args=2
CoalesceScalarStringBench/2   70821126 ns     70771323 ns           10 bytes_per_second=5.35265G/s items_per_second=14.8164M/s length=1048.58k null%=25 num_args=2
CoalesceScalarStringBench/4   47119447 ns     47087724 ns           13 bytes_per_second=5.40053G/s items_per_second=22.2686M/s length=1048.58k null%=50 num_args=2
CoalesceScalarStringBench/6    4579486 ns      4576728 ns          150 bytes_per_second=1.93403G/s items_per_second=229.11M/s length=1048.58k null%=99 num_args=2
```
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@dianaclarke
Copy link
Contributor

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Aug 26, 2021

Benchmark runs are scheduled for baseline = 20885df and contender = 43a4450. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.46% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.55% ⬆️0.51%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@pitrou
Copy link
Member

pitrou commented Aug 26, 2021

I think the benchmarks changed from git master, so I'm not sure there will be meaningful changes shown by conbench.

@pitrou
Copy link
Member

pitrou commented Aug 27, 2021

I think the benchmarks changed from git master, so I'm not sure there will be meaningful changes shown by conbench.

Indeed, it seems there are no significant changes in the already existing benchmarks.

@pitrou pitrou closed this in eee6bf1 Aug 27, 2021
@pitrou
Copy link
Member

pitrou commented Aug 27, 2021

I've submitted a PR to your branch to improve performance:
lidavidm#4

I think the same approach may be used for other conditional kernels, but that would be a separate JIRA.

Apparently if_else is already quite optimized, so I'm not sure there's anything to be done. I didn't look at case_when.

@lidavidm
Copy link
Member Author

lidavidm commented Aug 27, 2021

Thanks! case_when could probably use some work, though I'm currently in the middle of digging into it to get dictionary support. select also could use some work (though the task there is harder). I'll file issues/take a look when I get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants