Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jun 17, 2020

With clang-8 on Linux this unit is now 654KB down from 1257KB.

A few strategies:

  • Use same binary code for Less/Greater and LessEqual/GreaterEqual with arguments flipped
  • Reuse kernels for primitive types represented as integers
  • Reuse kernels for binary/string and largebinary/largestring

On the latter matter, I had to rewrite some of the codegen_internal.h routines where they assumed they would receive the specific logical type as the compile-time parameter (possibly causing a conflict when trying to put a BinaryArray box around a StringArray's ArrayData). This makes kernel execution against string data faster in the places where ArrayIterator is used. Will run benchmarks

As an aside, in general these kernel code generators will likely have lower and lower level code over time.

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

@ursabot benchmark --benchmark_filter=Greater 18e559b

@ursabot
Copy link

ursabot commented Jun 17, 2020

no such option: --benchmark_filter

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

@ursabot benchmark --benchmark-filter=Greater 18e559b

@github-actions
Copy link

@ursabot
Copy link

ursabot commented Jun 17, 2020

AMD64 Ubuntu 18.04 C++ Benchmark (#112653) builder has been succeeded.

Revision: 53671af32c338fcca1edc40732c4c5fd1ad7585e

  ====================================  ==================  ==================  ========
  benchmark                             baseline            contender           change
  ====================================  ==================  ==================  ========
  GreaterArrayArrayString/32768/2       53.657m items/sec   94.257m items/sec   75.665%
- GreaterArrayArrayInt64/32768/2        363.591m items/sec  337.022m items/sec  -7.307%
  GreaterArrayScalarInt64/32768/100     598.659m items/sec  701.841m items/sec  17.236%
  GreaterArrayScalarString/32768/0      341.707m items/sec  463.243m items/sec  35.567%
  GreaterArrayArrayString/32768/100     54.363m items/sec   75.156m items/sec   38.249%
- GreaterArrayArrayInt64/32768/1        364.615m items/sec  337.861m items/sec  -7.338%
- GreaterArrayArrayInt64/32768/0        365.779m items/sec  338.697m items/sec  -7.404%
  GreaterArrayArrayString/32768/10000   54.734m items/sec   75.788m items/sec   38.465%
  GreaterArrayScalarString/32768/1      364.271m items/sec  569.761m items/sec  56.411%
  GreaterArrayScalarInt64/32768/2       596.674m items/sec  681.189m items/sec  14.164%
- GreaterArrayArrayInt64/32768/10000    364.433m items/sec  338.039m items/sec  -7.242%
  GreaterArrayArrayString/32768/1       127.275m items/sec  195.144m items/sec  53.325%
  GreaterArrayScalarInt64/32768/10      614.042m items/sec  697.835m items/sec  13.646%
  GreaterArrayScalarInt64/32768/1       613.585m items/sec  692.598m items/sec  12.877%
- GreaterArrayArrayInt64/32768/100      360.385m items/sec  337.343m items/sec  -6.394%
  GreaterArrayScalarString/32768/100    331.392m items/sec  447.519m items/sec  35.042%
  GreaterArrayScalarInt64/32768/0       624.860m items/sec  705.244m items/sec  12.864%
- GreaterArrayArrayInt64/32768/10       363.073m items/sec  337.156m items/sec  -7.138%
  GreaterArrayScalarInt64/32768/10000   608.171m items/sec  678.010m items/sec  11.483%
  GreaterArrayScalarString/32768/10     256.765m items/sec  352.403m items/sec  37.247%
  GreaterArrayScalarString/32768/10000  335.759m items/sec  461.788m items/sec  37.535%
  GreaterArrayArrayString/32768/0       54.130m items/sec   75.873m items/sec   40.168%
  GreaterArrayScalarString/32768/2      130.823m items/sec  194.535m items/sec  48.701%
  GreaterArrayArrayString/32768/10      51.190m items/sec   71.220m items/sec   39.127%
  ====================================  ==================  ==================  ========

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

Not sure why some of the Int64 benchmarks are slower (AFAIK nothing changed about these so it may be noise) but the string comparisons got a lot faster.

I had to engage in some slight shenanigans to make temporal Scalar types have a common base type so that we can use e.g. PrimitiveScalar<Int64Scalar> to safely unbox a TimestampScalar. This involved adding a IMHO overdue StorageType attribute to several primitive type objects. Otherwise it's difficult to use an Int64 kernel to process Timestamp, Time64, Duration data, etc. Please take a look @bkietz or @pitrou

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

There weren't any C++ unit tests for comparisons of primitive types so I addressed that, and also added comparisons for Time and Duration types (which on account of this patch are basically "free")

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

@ursabot benchmark --benchmark-filter=Greater 18e559b

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou anecdotally it appears this is meaningfully faster than the prior version which used GetView

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably depend on the compiler? Logically, it's quite similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the proof is in the benchmarks. I can start writing microbenchmarks for these internal utilities to help us avoid having arguments about them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think micro-benchmarks are useful here, as actual performance will depend on overall optimization of the calling function by the compiler.

In any case, I don't dispute the numbers. But it also seems the improvements would be mostly compiler-dependent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right well AFAICT this change is responsible for the 40-50% performance improvement. So I'm calling it out because there are other places where we use the GetView method that may therefore be performing suboptimally (even if only on some compilers). If it were only 5% then it would be inconclusive.

@ursabot
Copy link

ursabot commented Jun 17, 2020

AMD64 Ubuntu 18.04 C++ Benchmark (#112703) builder has been succeeded.

Revision: 301ffa539e634f2c464ca072cd5c543f1407f1f7

  ====================================  ==================  ==================  ========
  benchmark                             baseline            contender           change
  ====================================  ==================  ==================  ========
  GreaterArrayArrayString/32768/2       53.694m items/sec   93.892m items/sec   74.864%
  GreaterArrayScalarInt64/32768/2       616.025m items/sec  681.971m items/sec  10.705%
  GreaterArrayScalarString/32768/2      132.267m items/sec  216.365m items/sec  63.583%
- GreaterArrayArrayInt64/32768/10000    365.572m items/sec  339.713m items/sec  -7.074%
  GreaterArrayArrayString/32768/100     53.635m items/sec   74.222m items/sec   38.384%
- GreaterArrayArrayInt64/32768/100      365.263m items/sec  339.865m items/sec  -6.953%
  GreaterArrayScalarString/32768/10000  341.705m items/sec  547.337m items/sec  60.178%
- GreaterArrayArrayInt64/32768/1        365.608m items/sec  340.598m items/sec  -6.841%
  GreaterArrayArrayString/32768/0       55.046m items/sec   74.164m items/sec   34.733%
  GreaterArrayScalarInt64/32768/10000   551.861m items/sec  679.554m items/sec  23.139%
  GreaterArrayScalarString/32768/1      360.301m items/sec  724.313m items/sec  101.030%
  GreaterArrayScalarInt64/32768/10      618.148m items/sec  669.015m items/sec  8.229%
- GreaterArrayArrayInt64/32768/2        365.552m items/sec  340.107m items/sec  -6.961%
  GreaterArrayScalarString/32768/100    332.135m items/sec  529.815m items/sec  59.518%
- GreaterArrayArrayInt64/32768/0        367.549m items/sec  341.912m items/sec  -6.975%
- GreaterArrayArrayInt64/32768/10       365.734m items/sec  339.651m items/sec  -7.131%
  GreaterArrayScalarInt64/32768/100     604.015m items/sec  667.332m items/sec  10.483%
  GreaterArrayScalarString/32768/10     258.597m items/sec  411.304m items/sec  59.052%
  GreaterArrayScalarInt64/32768/0       622.436m items/sec  671.837m items/sec  7.937%
  GreaterArrayArrayString/32768/1       130.420m items/sec  196.548m items/sec  50.704%
  GreaterArrayScalarInt64/32768/1       622.602m items/sec  699.200m items/sec  12.303%
  GreaterArrayArrayString/32768/10      51.206m items/sec   70.119m items/sec   36.935%
  GreaterArrayScalarString/32768/0      337.556m items/sec  549.035m items/sec  62.650%
  GreaterArrayArrayString/32768/10000   54.218m items/sec   74.885m items/sec   38.117%
  ====================================  ==================  ==================  ========

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

Ah! It's because Greater is now implemented using Less. Let me switch things around so things are based on Greater/GreaterEqual instead

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

@ursabot benchmark --benchmark-filter=Greater 18e559b

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

Not to beat a dead horse about ARROW-9155, but the turnaround time for simple benchmarks isn't great

@dhirschfeld
Copy link

To help follow along it would be handy if references to JIRA could be autolinked:
https://help.github.com/en/github/administering-a-repository/configuring-autolinks-to-reference-external-resources

That would save the lurker from having to search for it or the poster having to bother about pasting an actual URL.

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

Well we have these bot comments, is it not sufficient? #7461 (comment)

@ursabot
Copy link

ursabot commented Jun 17, 2020

AMD64 Ubuntu 18.04 C++ Benchmark (#112729) builder has been succeeded.

Revision: 74caaae25e3bd95d57f3f6d9b835c2610639ab41

  ====================================  ==================  ==================  ========
  benchmark                             baseline            contender           change
  ====================================  ==================  ==================  ========
  GreaterArrayScalarString/32768/2      130.267m items/sec  224.418m items/sec  72.276%
  GreaterArrayScalarString/32768/10     261.021m items/sec  498.491m items/sec  90.977%
- GreaterArrayArrayInt64/32768/2        361.493m items/sec  335.383m items/sec  -7.223%
  GreaterArrayScalarInt64/32768/10      565.966m items/sec  690.931m items/sec  22.080%
  GreaterArrayScalarInt64/32768/100     555.331m items/sec  668.238m items/sec  20.331%
- GreaterArrayArrayInt64/32768/10000    364.755m items/sec  338.493m items/sec  -7.200%
- GreaterArrayArrayInt64/32768/1        362.806m items/sec  336.071m items/sec  -7.369%
  GreaterArrayScalarInt64/32768/1       573.242m items/sec  704.449m items/sec  22.889%
  GreaterArrayArrayString/32768/10000   54.921m items/sec   73.848m items/sec   34.463%
  GreaterArrayArrayString/32768/2       52.950m items/sec   92.643m items/sec   74.964%
  GreaterArrayScalarString/32768/1      361.437m items/sec  653.553m items/sec  80.821%
  GreaterArrayScalarInt64/32768/0       555.044m items/sec  701.556m items/sec  26.396%
  GreaterArrayScalarInt64/32768/10000   575.762m items/sec  682.164m items/sec  18.480%
- GreaterArrayArrayInt64/32768/100      361.216m items/sec  338.015m items/sec  -6.423%
  GreaterArrayScalarString/32768/0      339.144m items/sec  707.998m items/sec  108.760%
  GreaterArrayArrayString/32768/1       127.597m items/sec  193.087m items/sec  51.326%
  GreaterArrayArrayString/32768/10      50.585m items/sec   69.612m items/sec   37.613%
  GreaterArrayArrayString/32768/100     54.563m items/sec   73.054m items/sec   33.889%
  GreaterArrayScalarInt64/32768/2       558.941m items/sec  697.482m items/sec  24.786%
  GreaterArrayScalarString/32768/100    331.214m items/sec  669.210m items/sec  102.048%
- GreaterArrayArrayInt64/32768/10       361.791m items/sec  336.834m items/sec  -6.898%
  GreaterArrayArrayString/32768/0       54.614m items/sec   73.456m items/sec   34.499%
  GreaterArrayScalarString/32768/10000  342.276m items/sec  705.190m items/sec  106.030%
- GreaterArrayArrayInt64/32768/0        363.455m items/sec  336.738m items/sec  -7.351%
  ====================================  ==================  ==================  ========

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

Well my theory about greater/less didn't hold. The other relevant change was moving things into the anonymous namespace. It's possible that anonymous namespaces impact inlining somehow

@dhirschfeld
Copy link

Not to beat a dead horse about ARROW-9155

The bot is fine - I guess it links to whatever JIRA is listed in the title. That doesn't help if someone mentions a JIRA in a comment. If the autolink was configured for this repo that reference would have automatically been converted to ARROW-9155.

It's just a very minor thing, but it does help the DX.

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

Can you bring it up somewhere else like on the mailing list? Someone can ask Apache INFRA to set this up

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

This can be reviewed, but I realized I can compress this further by reusing some equals/not_equal kernels, so I will quickly try to do that this morning

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only took a cursory look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably depend on the compiler? Logically, it's quite similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this use of inheritance is kosher. This implies TimestampScalar isa Int64Scalar, which is a relationship not present in parallel hierarchies (Array, DataType, Builder) and is therefore likely to be confusing. Why not use TemporalScalar : PrimitiveScalar<T>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int64Scalar and TimestampScalar need to have a common base type in order to both be unboxed by templated code that uses Int64Type in the kernel codegen. So PrimitiveScalar<TimestampType> won't work.

@wesm
Copy link
Member Author

wesm commented Jun 17, 2020

I'm revamping the documentation about these codegen functions which I'm dubbing "Generator-Dispatchers" (GDs) for short. I'll add "Generate" to their name. Stay tuned

@wesm
Copy link
Member Author

wesm commented Jun 18, 2020

+1. I'm going to merge this to help avoid conflicts caused by the stuff I just renamed. I welcome further comments and I will work to address them in follow ups

@wesm
Copy link
Member Author

wesm commented Jun 18, 2020

I gave up on trying to have e.g. a common "64-bit" kernel for Equals/NotEquals Int64/UInt64/Timestamp/etc. The sticking point is scalar unboxing. We might need to fashion a common internal base class for primitive types and give them a virtual method that returns their data as const uint8_t* that you can cast to whatever primitive type you want

@wesm wesm closed this in bd42052 Jun 18, 2020
@wesm wesm deleted the ARROW-8969 branch June 18, 2020 01:17
@wesm
Copy link
Member Author

wesm commented Jun 19, 2020

I just requested JIRA reference autolinking https://issues.apache.org/jira/browse/INFRA-20450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants