Skip to content

Conversation

@emkornfield
Copy link
Contributor

Picked up from: #1970

I mostly reused the unit tests as is and modified the rest based on feedback on that PR and adapting to new code. Some other changes I mae:

  1. Remove the HashKernel and getter from the public API. I think we can add these back once we have a better idea on how we are doing stateful kernels (i.e. ARROW-4124: [C++] Draft Aggregate and Sum kernels #3407)
  2. Add operator[] to NumericBuilder to modify previously added values (this might not be the right place, and I had a little bit of trouble figuring out how to integrate this into the existing TypedTest so the testing is a little bit weak). This seemed better then using a vector to collect values.

If this approach looks OK for now. I'm also going to open up a JIRA to try refactor the unittest for Hash kernels (and maybe the headers) since I think there might be a clearer more granular way of laying these out.

other things to discuss:

  1. Handling null values in Count/Unique (it looks like they are dropped today, should this configurable/turned on).
  2. Hashing edge cases for floating point numbers (just opened a JIRA on this).

@emkornfield emkornfield changed the title Count kernel ARROW-1572: [C++] Implement "value counts" kernels for tabulating value frequencies Feb 7, 2019
@emkornfield emkornfield changed the title ARROW-1572: [C++] Implement "value counts" kernels for tabulating value frequencies ARROW-1572: [WIP][C++] Implement "value counts" kernels for tabulating value frequencies Feb 7, 2019
@andrioni
Copy link
Contributor

andrioni commented Feb 7, 2019

Thanks for picking this up from where I left off, I'll close my original PR.

@emkornfield
Copy link
Contributor Author

@andrioni Thanks for the contribution

@emkornfield emkornfield changed the title ARROW-1572: [WIP][C++] Implement "value counts" kernels for tabulating value frequencies ARROW-1572: [C++] Implement "value counts" kernels for tabulating value frequencies Feb 7, 2019
Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes:

  • May I propose to rename this to Distribution?
  • We should also count Nulls.
  • The current output format is 2 arrays. I think we should formalize this and return a single array of List<Struct<Option<value>, count>> explicitly which should allow returning the count for the null value.

This can be done in another ticket/PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, you should set the datum type if we ever refactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify? I might be missing it but I don't see an API to set the type explicitly (instead it looks like it gets derived when value that is assigned in the underlying variant)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's handled by the variant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my bad.

@xhochy
Copy link
Member

xhochy commented Feb 8, 2019

Rebased

@emkornfield
Copy link
Contributor Author

@fsaintjacques thanks for the review. My thoughts:

  • May I propose to rename this to Distribution?
    ** I renamed to ValueCounts to match up with corresponding Pandas function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html), Distribution might also make sense. Since naming is hard, I'd like to get a second opinion before changing.
  • We should also count Nulls.
    ** I think this should be configurable, similar to Pandas, (and affects Uniques) as well, since this is a bigger change, I'd like to handle under a different JIRA if that is OK.
  • The current output format is 2 arrays. I think we should formalize this and return a single array of List<Struct<Option<value>, count>> explicitly which should allow returning the count for the null value.
    ** Agreed, I am working through making this happen and will update the PR when I complete it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could potentially return a StructArray directly.

@emkornfield
Copy link
Contributor Author

@fsaintjacques I think I fixed the windows build (will check in the morning). I changed the output to be a Struct as suggested. I will file follow-up JIRAs for the rest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put a DCHECK_GE(start, 0)` just to make sure it's positive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also put a ASSERT_LE in and changed to std::copy which i think fixes another bug.

@emkornfield emkornfield force-pushed the count_kernel branch 2 times, most recently from 0176819 to ae08e40 Compare March 1, 2019 04:11
@emkornfield
Copy link
Contributor Author

Clean build 👍

@fsaintjacques
Copy link
Contributor

I want to give a last review, but I don't think anything is blocking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make Values and Count a public constexpr such that user can programatically retrieve the field by name at compile time, maybe expose the index that way too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Done.

@fsaintjacques
Copy link
Contributor

Ready to merge!

@fsaintjacques
Copy link
Contributor

I think you need to rebase for the failing appveyor failure, but I'd say this is safe to merge. LGTM +1.

@emkornfield emkornfield force-pushed the count_kernel branch 3 times, most recently from 7825282 to 2845381 Compare March 4, 2019 03:59
@emkornfield
Copy link
Contributor Author

One thing that I'll do which I'm not sure will help too much is separate out the memcpy fix into it's own PR so we can isolate performance problems to the kernel

@emkornfield
Copy link
Contributor Author

I think there might also be a way forward with template magic to eliminate this altogether. I'll get an update out soon

@emkornfield emkornfield reopened this Mar 11, 2019
@emkornfield
Copy link
Contributor Author

emkornfield commented Mar 12, 2019

With this change it looks like they are fairly close on my box (rerunning causes changes in different directions in different metrics)

Screen Shot 2019-03-11 at 9 24 19 PM

Raw text:
Benchmark                                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------
BM_BuildDictionary/min_time:1.000                                         -0.0038         -0.0036          3050          3039          3041          3030
BM_BuildStringDictionary/min_time:1.000                                   -0.0362         -0.0310          7278          7014          7211          6987
BM_UniqueInt64NoNulls/16777216/50/min_time:1.000/real_time                +0.0020         +0.0006         47799         47894         47697         47724
BM_UniqueInt64NoNulls/16777216/1024/min_time:1.000/real_time              +0.0125         +0.0172         47223         47813         46874         47678
BM_UniqueInt64NoNulls/16777216/10240/min_time:1.000/real_time             +0.0112         +0.0110         81461         82375         81277         82175
BM_UniqueInt64NoNulls/16777216/1048576/min_time:1.000/real_time           -0.0059         -0.0065        959365        953714        956518        950281
BM_UniqueInt64WithNulls/16777216/50/min_time:1.000/real_time              +0.0295         +0.0304         68487         70507         68319         70399
BM_UniqueInt64WithNulls/16777216/1024/min_time:1.000/real_time            +0.0095         +0.0092         65062         65680         64903         65499
BM_UniqueInt64WithNulls/16777216/10240/min_time:1.000/real_time           +0.0364         +0.0366         88717         91943         88515         91752
BM_UniqueInt64WithNulls/16777216/1048576/min_time:1.000/real_time         +0.0091         +0.0091        775562        782628        773094        780125
BM_UniqueString10bytes/16777216/50/min_time:1.000/real_time               -0.0001         -0.0007        325153        325115        324190        323967
BM_UniqueString10bytes/16777216/1024/min_time:1.000/real_time             -0.0109         -0.0039        275206        272213        272525        271460
BM_UniqueString10bytes/16777216/10240/min_time:1.000/real_time            -0.0373         -0.0374        458229        441117        456930        439855
BM_UniqueString10bytes/16777216/1048576/min_time:1.000/real_time          +0.0183         +0.0182       3150635       3208440       3142222       3199298
BM_UniqueString100bytes/16777216/50/min_time:1.000/real_time              -0.0034         -0.0022        465118        463514        463705        462684
BM_UniqueString100bytes/16777216/1024/min_time:1.000/real_time            -0.0381         -0.0383        644111        619588        642393        617816
BM_UniqueString100bytes/16777216/10240/min_time:1.000/real_time           +0.0546         +0.0554        957323       1009574        954481       1007365
BM_UniqueString100bytes/16777216/1048576/min_time:1.000/real_time         +0.0170         +0.0186       5132796       5220195       5099248       5193933
BM_UniqueUInt8NoNulls/16777216/200/min_time:1.000/real_time               +0.0063         +0.0070         60059         60437         59915         60334
BM_UniqueUInt8WithNulls/16777216/200/min_time:1.000/real_time             -0.0231         -0.0229         74762         73035         74568         72859

@wesm
Copy link
Member

wesm commented Mar 14, 2019

Looking at this again and then merging, thanks for your patience

@wesm
Copy link
Member

wesm commented Mar 14, 2019

I ran on a mobile Xeon 2.8 ghz with the cores pinned at 2.8ghz / no turbo boost and things looking OK here

before

---------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time           CPU Iterations
---------------------------------------------------------------------------------------------------------
BM_BuildDictionary/min_time:1.000                                       2950 us       2950 us        476   1.32273GB/s
BM_BuildStringDictionary/min_time:1.000                                 5941 us       5940 us        234   50.8258MB/s
BM_UniqueInt64NoNulls/16777216/50/min_time:1.000/real_time             46613 us      46614 us         30   2.68166GB/s
BM_UniqueInt64NoNulls/16777216/1024/min_time:1.000/real_time           44150 us      44142 us         27   2.83129GB/s
BM_UniqueInt64NoNulls/16777216/10240/min_time:1.000/real_time          65805 us      65805 us         21   1.89955GB/s
BM_UniqueInt64NoNulls/16777216/1048576/min_time:1.000/real_time       505750 us     505756 us          3    253.09MB/s
BM_UniqueInt64WithNulls/16777216/50/min_time:1.000/real_time           70311 us      70312 us         20   1.77782GB/s
BM_UniqueInt64WithNulls/16777216/1024/min_time:1.000/real_time         62900 us      62901 us         22   1.98729GB/s
BM_UniqueInt64WithNulls/16777216/10240/min_time:1.000/real_time        85752 us      85754 us         17   1.45769GB/s
BM_UniqueInt64WithNulls/16777216/1048576/min_time:1.000/real_time     486725 us     486734 us          3   262.982MB/s
BM_UniqueString10bytes/16777216/50/min_time:1.000/real_time           308199 us     308202 us          4   519.145MB/s
BM_UniqueString10bytes/16777216/1024/min_time:1.000/real_time         262568 us     262572 us          5   609.367MB/s
BM_UniqueString10bytes/16777216/10240/min_time:1.000/real_time        439183 us     439189 us          3   364.313MB/s
BM_UniqueString10bytes/16777216/1048576/min_time:1.000/real_time     2329791 us    2329828 us          1   68.6757MB/s
BM_UniqueString100bytes/16777216/50/min_time:1.000/real_time          693907 us     693920 us          2   2.25174GB/s
BM_UniqueString100bytes/16777216/1024/min_time:1.000/real_time        821354 us     821368 us          2   1.90235GB/s
BM_UniqueString100bytes/16777216/10240/min_time:1.000/real_time      1094868 us    1094886 us          1   1.42711GB/s
BM_UniqueString100bytes/16777216/1048576/min_time:1.000/real_time    3770885 us    3770950 us          1   424.304MB/s
BM_UniqueUInt8NoNulls/16777216/200/min_time:1.000/real_time            13690 us      13690 us        102   1.14134GB/s
BM_UniqueUInt8WithNulls/16777216/200/min_time:1.000/real_time          31576 us      31574 us         45   506.714MB/s

after

---------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time           CPU Iterations
---------------------------------------------------------------------------------------------------------
BM_BuildDictionary/min_time:1.000                                       2969 us       2969 us        472   1.31424GB/s
BM_BuildStringDictionary/min_time:1.000                                 5966 us       5966 us        235   50.6048MB/s
BM_UniqueInt64NoNulls/16777216/50/min_time:1.000/real_time             46763 us      46763 us         30   2.67303GB/s
BM_UniqueInt64NoNulls/16777216/1024/min_time:1.000/real_time           44425 us      44426 us         32   2.81373GB/s
BM_UniqueInt64NoNulls/16777216/10240/min_time:1.000/real_time          69868 us      69869 us         21    1.7891GB/s
BM_UniqueInt64NoNulls/16777216/1048576/min_time:1.000/real_time       511248 us     511255 us          3   250.368MB/s
BM_UniqueInt64WithNulls/16777216/50/min_time:1.000/real_time           70771 us      70770 us         20   1.76626GB/s
BM_UniqueInt64WithNulls/16777216/1024/min_time:1.000/real_time         63143 us      63143 us         22   1.97964GB/s
BM_UniqueInt64WithNulls/16777216/10240/min_time:1.000/real_time        83603 us      83604 us         17   1.49516GB/s
BM_UniqueInt64WithNulls/16777216/1048576/min_time:1.000/real_time     486510 us     486505 us          3   263.098MB/s
BM_UniqueString10bytes/16777216/50/min_time:1.000/real_time           305595 us     305600 us          5   523.569MB/s
BM_UniqueString10bytes/16777216/1024/min_time:1.000/real_time         251144 us     251149 us          6   637.084MB/s
BM_UniqueString10bytes/16777216/10240/min_time:1.000/real_time        428113 us     428118 us          3   373.733MB/s
BM_UniqueString10bytes/16777216/1048576/min_time:1.000/real_time     2338876 us    2338908 us          1   68.4089MB/s
BM_UniqueString100bytes/16777216/50/min_time:1.000/real_time          691089 us     691102 us          2   2.26093GB/s
BM_UniqueString100bytes/16777216/1024/min_time:1.000/real_time        803811 us     803825 us          2   1.94387GB/s
BM_UniqueString100bytes/16777216/10240/min_time:1.000/real_time      1103679 us    1103689 us          1   1.41572GB/s
BM_UniqueString100bytes/16777216/1048576/min_time:1.000/real_time    4039441 us    4039457 us          1   396.094MB/s
BM_UniqueUInt8NoNulls/16777216/200/min_time:1.000/real_time            13894 us      13894 us        101   1.12461GB/s
BM_UniqueUInt8WithNulls/16777216/200/min_time:1.000/real_time          30398 us      30398 us         46   526.348MB/s

merging. thanks @emkornfield!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants