Skip to content

Conversation

@andrioni
Copy link
Contributor

@andrioni andrioni commented May 1, 2018

From the tests it seems to work, although I'm still learning how everything works. The implementation is based on how the other hash kernels work.

wesm and others added 30 commits February 22, 2018 12:09
…subprocess

This enables this test to pass in an in-place build without running `setup.py develop`

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1640 from wesm/ARROW-2131 and squashes the following commits:

460ded1 [Wes McKinney] Use more idiomatic option for getting module parent directory
faa247a [Wes McKinney] Prepend development module path when spawning subprocess in test_serialization.py
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1638 from wesm/ARROW-2180 and squashes the following commits:

a13931c [Wes McKinney] Remove deprecated C++ APIs from 0.8.0 cycle
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1636 from wesm/ARROW-2132 and squashes the following commits:

dbbef5d [Wes McKinney] Add link to Plasma in main README, tweak language
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1641 from wesm/ARROW-2069 and squashes the following commits:

e66d1b6 [Wes McKinney] Add note that Plasma is not supported on Windows
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1639 from wesm/ARROW-2185 and squashes the following commits:

adf038d  Fix regex
75c3b17  Do not eat closing braces
baa7e2c  Strip CI directives from commit messages when preparing squashed commit message
…re brackets

Change-Id: Iad77d14c9116c30249e3e7f13ed40ed82e579832
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1637 from wesm/ARROW-2093 and squashes the following commits:

e3040a9 <Wes McKinney> Do not install PyTorch in Travis CI
Author: Brian Hulette <brian.hulette@ccri.com>

Closes apache#1647 from TheNeuralBit/js-doc-site and squashes the following commits:

e5ca4df [Brian Hulette] Add JS docs to website
Original question:

apache#1510

Improvement story:

https://issues.apache.org/jira/browse/ARROW-2066

Author: rrussell <rrussell@adobe.com>

Closes apache#1544 from rjrussell77/arrow-2066-docs-azure-parquet and squashes the following commits:

0d3972c <rrussell> Add missing byte_stream declaration/assignment
a5addb0 <rrussell> use more common 'df' instead of 'pd' for pandas dataframe variable, remove head() call and instead use comment to indicate generic fill-in code, add comment re: stream closure in finally block
f056888 <rrussell> Clean up white space
1fe9866 <rrussell> Add try/except/finally blocks to ensure closure of the byte stream
36f7378 <rrussell> Replace usage of tempfile buffer with BytesIO stream
654a6f9 <rrussell> Add back original Notes bullets
5d450fc <rrussell> fix
4770de1 <rrussell> fix
4c75824 <rrussell> Try moving the bullet to remove italics
803cbca <rrussell> Use asterisks for list
051b91d <rrussell> Fix formatting
a015deb <rrussell> Fix formatting
1815816 <rrussell> fix formatting
599e04f <rrussell> Fix formatting
34c5a16 <rrussell> Fix formatting
6fd9f70 <rrussell> remove inline edits
83a38c4 <rrussell> Try to fix italics
f130e04 <rrussell> Change wording a bit
718bd94 <rrussell> Fix unintended italics
7bab640 <rrussell> Refine indented bullet and fix title underline
26a53e4 <rrussell> Fix formatting
5fbea89 <rrussell> Add a note about keys and add polish
5365a9c <rrussell> Add helpful notes about Azure properties
6841116 <rrussell> Polish the formatting
eb643e4 <rrussell> ARROW-2066 Add documentation for Arrow/Azure/Parquet solution
Author: Antoine Pitrou <antoine@python.org>

Closes apache#1644 from pitrou/ARROW-2197-doc-undefined-symbol and squashes the following commits:

99db6e0 <Antoine Pitrou> ARROW-2197: Document C++ ABI issue and workaround
…ng shared_ptr to OutputStream

Add constructors to return pointers to the base interface

Author: Panchen Xue <pan.panchen.xue@gmail.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1642 from xuepanchen/ARROW-2184 and squashes the following commits:

43dc483 <Wes McKinney> Update Python bindings to use new FileOutputStream API
d093062 <Panchen Xue> Add test cases
f4e3b6a <Panchen Xue> Add static constructor for FileOutputStream returning pointer to base OutputStream
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1633 from xhochy/ARROW-2191 and squashes the following commits:

5f67465 <Uwe L. Korn> ARROW-2191:  Only use specific version of jemalloc
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1655 from xhochy/ARROW-2204 and squashes the following commits:

1a21f17 [Uwe L. Korn] ARROW-2204: Fix TLS errors in manylinux1 build
… to its indices' nullBitmap

Resolves https://issues.apache.org/jira/browse/ARROW-2214

Author: Paul Taylor <paul.e.taylor@me.com>

Closes apache#1659 from trxcllnt/ARROW-2214 and squashes the following commits:

8c4ba46 <Paul Taylor> add nullBitmap getter to DictionaryData that proxies to its indices' nullBitmap
Saves ~6min build time.

Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1662 from xhochy/ARROW-2212 and squashes the following commits:

acf2319 <Uwe L. Korn> ARROW-2212:  Build Protobuf in base manylinux1 docker image
…g toolchain

C++-only libprotobuf is now being provided by conda-forge

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1664 from wesm/ARROW-2094 and squashes the following commits:

cec6c88 <Wes McKinney> Install libprotobuf and set PROTOBUF_HOME when using toolchain
This should resolve https://issues.apache.org/jira/browse/ARROW-2213

Author: Paul Taylor <paul.e.taylor@me.com>

Closes apache#1658 from trxcllnt/js-fix-npm-release and squashes the following commits:

71273fa <Paul Taylor> specify unpkg key in apache-arrow module's package.json
fd4d5be <Paul Taylor> npm publish via lerna exec
@TheNeuralBit I think this is my bad as "indicies" isn't a real word. what do you think, is this PR worth it? should we name this "keys" instead?

Author: Paul Taylor <paul.e.taylor@me.com>

Closes apache#1666 from trxcllnt/js-fix-indicies-typo and squashes the following commits:

711571a <Paul Taylor> fix typo -- rename indicies to indices
See https://issues.apache.org/jira/browse/ARROW-2206

+ https://github.com/deepankarsharma

Author: lmeyerov <lmeyerov@gmail.com>

Closes apache#1652 from lmeyerov/patch-2 and squashes the following commits:

904e172 <lmeyerov> fix(reduce perspective description)
91291a0 <lmeyerov> fix(Perspective description)
b8ef13e <lmeyerov> fix(missing colon)
80a4630 <lmeyerov> Document Perspective project  (main site)
cc938c1 <lmeyerov>  ARROW-2206 - Document Perspective project
… enable ASAN builds, add more dev docs

More to do here, see discussion in https://issues.apache.org/jira/browse/ARROW-1589

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1503 from wesm/ARROW-2023 and squashes the following commits:

e3a590e <Wes McKinney> Do not invoke memcpy unless number of bytes to read greater than 0. Return more informative error message
90787ec <Wes McKinney> Add ASAN instructions, llvm-symbolizer instructions for fuzzers. Remove Apache Kudu cruft from run-test.sh
53d2c92 <Wes McKinney> Add missing lsan-suppressions.txt to enable ASAN to run with unittests
69cb5b7 <Wes McKinney> Start C++ stream reader tests for malformed inputs
Author: Antoine Pitrou <antoine@python.org>

Closes apache#1665 from pitrou/ARROW-1035-streaming-benchmark and squashes the following commits:

32b1956 <Antoine Pitrou> ARROW-1035:  Add streaming dataframe reconstruction benchmark
Output stream that just writes to stderr. Adapted from StdoutStream, but use `cerr` instead of `cout`.

Author: rvernica <rvernica@gmail.com>

Closes apache#1657 from rvernica/patch-1 and squashes the following commits:

535dd25 <rvernica> Fix lint issue - trim whitespace
4aa45b0 <rvernica> ARROW-2203:  StderrStream class
Author: Antoine Pitrou <antoine@python.org>

Closes apache#1663 from pitrou/ARROW-1937-nested-array-init-doc and squashes the following commits:

b85d1b2 <Antoine Pitrou> ARROW-1937:  Document nested array initialization
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1672 from xhochy/ARROW-2210 and squashes the following commits:

6e5febe <Uwe L. Korn> ARROW-2210:  Reset ptr on failed memory allocation
This ensures the `bin` files are uniformly es5/commonjs for all target modules. Now this works:
```sh
npm i @apache-arrow/esnext-esm
npx arrow2csv -f some-file.arrow
```

Resolves https://issues.apache.org/jira/browse/ARROW-2223

Author: Paul Taylor <paul.e.taylor@me.com>

Closes apache#1669 from trxcllnt/js-add-bin-to-umd-targets and squashes the following commits:

9c6b787 <Paul Taylor> compile src/bin as es5-cjs to all output targets
Author: Korn, Uwe <Uwe.Korn@blue-yonder.com>

Closes apache#1676 from xhochy/ARROW-2230 and squashes the following commits:

b46c1a2 <Korn, Uwe> ARROW-2230:  Strip catch-all tag matching from git-describe
Author: Antoine Pitrou <antoine@python.org>

Closes apache#1674 from pitrou/ARROW-2218-infer-python-file-mode and squashes the following commits:

a5d704e <Antoine Pitrou> ARROW-2218:  PythonFile should infer mode when not given
Resolves https://issues.apache.org/jira/browse/ARROW-2226

Author: Paul Taylor <paul.e.taylor@me.com>
Author: Brian Hulette <brian.hulette@ccri.com>

Closes apache#1671 from trxcllnt/js-fix-dictionary-data and squashes the following commits:

ccecf55 <Paul Taylor> Merge pull request apache#5 from TheNeuralBit/dictionary-vector-tests
3fb9a26 <Brian Hulette> Fix bug in DictionaryVector with nullable indices
2888657 <Brian Hulette> Add dictionary vector unit tests
b0a0c08 <Paul Taylor> use indicies.offset in DictionaryData constructor
@TheNeuralBit somewhere along the way we lost this. Resolves https://issues.apache.org/jira/browse/ARROW-2225

Author: Paul Taylor <paul.e.taylor@me.com>

Closes apache#1670 from trxcllnt/js-fix-multiple-buffers and squashes the following commits:

869cd74 <Paul Taylor> make the reader work for tables split across buffers again
pitrou and others added 15 commits April 23, 2018 22:56
This makes GetFileSize thread-safe and also reduces its cost.

Author: Antoine Pitrou <antoine@python.org>

Closes apache#1934 from pitrou/ARROW-2470-getfilesize and squashes the following commits:

5239207 <Antoine Pitrou> ARROW-2470:  Avoid seeking in GetFileSize
When compiled with GPU support, the PlasmaClient ABI would differ, leading to a crash in the Python bindings to Plasma.

Author: Antoine Pitrou <antoine@python.org>

Closes apache#1933 from pitrou/ARROW-2489-plasma-client-abi and squashes the following commits:

7d28354 <Antoine Pitrou> ARROW-2489:  Fix PlasmaClient ABI variation
This PR restores the windows compatibility for the current master.  The memory pool abstraction is not used anywhere else within the code base and so is not included in windows builds.

Next I plan to add [CI](https://issues.apache.org/jira/browse/ARROW-2436?filter=12343557&jql=project%20%3D%20ARROW%20AND%20component%20%3D%20Rust%20AND%20status%20%3D%20Open) for windows.  Followed by [ARROW-2474](https://issues.apache.org/jira/browse/ARROW-2474?filter=12343557&jql=project%20%3D%20ARROW%20AND%20component%20%3D%20Rust%20AND%20status%20%3D%20Open), which I will add windows compatibility to the memory pool abstraction, allowing it to be used throughout the code base.

Author: Paddy <paddyhoran@hotmail.com>

Closes apache#1938 from paddyhoran/ARROW-2502 and squashes the following commits:

3e00f13 <Paddy> Fixes ARROW-2502
Author: Philipp Moritz <pcmoritz@gmail.com>
Author: Antoine Pitrou <pitrou@free.fr>

Closes apache#1944 from pcmoritz/fix-pytest-msg and squashes the following commits:

00b2cd4 <Antoine Pitrou> Use `match` argument as intended by the test
6d6cc68 <Philipp Moritz> fix pytest.raises msg to message
Also refactor the type inference visitor and remove the superfluous separate SeqVisitor; improve inference visitor performance by 30%; and add a struct type inference benchmark.

Author: Antoine Pitrou <antoine@python.org>

Closes apache#1935 from pitrou/ARROW-2074-infer-dict-lists and squashes the following commits:

13ed6c3 <Antoine Pitrou> Fix tests on 2.7
3baa2ea <Antoine Pitrou> ARROW-2074:  Infer lists of dicts as struct arrays
This is a followup to apache#1933 which does reference counting of the PlasmaClient held by PlasmaBuffers to avoid the segfault in ARROW-2448.

Author: Philipp Moritz <pcmoritz@gmail.com>

Closes apache#1939 from pcmoritz/autoget-sharedptr and squashes the following commits:

f1e6e8b <Philipp Moritz> fix test
2da395f <Philipp Moritz> update
13b1204 <Philipp Moritz> fixes
b68b15e <Philipp Moritz> fix ObjectStatus
6d560db <Philipp Moritz> remove headers
94cdfd7 <Philipp Moritz> add test
6798ed0 <Philipp Moritz> Give shared_ptr of PlasmaClient::Impl to PlasmaBuffer
Author: Krisztián Szűcs <szucs.krisztian@gmail.com>

Closes apache#1943 from kszucs/ARROW-2286 and squashes the following commits:

bbd496c <Krisztián Szűcs> fix review issues
f848858 <Krisztián Szűcs> implement StructValue.__getitem__
708e78f <Krisztián Szűcs> cpp unittests
f4a9bab <Krisztián Szűcs> GetChildByName and GetChildIndex for StructType
Author: Andy Grove <andy.grove@rms.com>

Closes apache#1936 from agrove-rms/jdk8 and squashes the following commits:

d5dca81 <Andy Grove> remove jdk7 from CI matrix
ef01df4 <Andy Grove> use java 1.8 instead of java 1.7
…tead of JDK 7

Author: Andy Grove <andy.grove@rms.com>

Closes apache#1956 from agrove-rms/restore_java_tests and squashes the following commits:

69309d0 <Andy Grove> re-instate JDK tests in matrix, but with JDK 8 instead of JDK 7
The build itself fails too:

```bash
[info] Compiling 103 Scala sources and 6 Java sources to /apache-arrow/spark/streaming/target/scala-2.11/classes...
[error] Compile failed at Apr 13, 2018 9:50:50 AM [1:05.284s]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 17.349 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 20.587 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 12.047 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  8.086 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 18.759 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 10.423 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 19.453 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 17.220 s]
[INFO] Spark Project Core ................................. SUCCESS [12:40 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 32.734 s]
[INFO] Spark Project GraphX ............................... SUCCESS [01:02 min]
[INFO] Spark Project Streaming ............................ FAILURE [01:09 min]
[INFO] Spark Project Catalyst ............................. SKIPPED
[INFO] Spark Project SQL .................................. SKIPPED
[INFO] Spark Project ML Library ........................... SKIPPED
[INFO] Spark Project Tools ................................ SKIPPED
[INFO] Spark Project Hive ................................. SKIPPED
[INFO] Spark Project REPL ................................. SKIPPED
[INFO] Spark Project Assembly ............................. SKIPPED
[INFO] Spark Integration for Kafka 0.10 ................... SKIPPED
[INFO] Kafka 0.10 Source for Structured Streaming ......... SKIPPED
[INFO] Spark Project Examples ............................. SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 17:29 min
[INFO] Finished at: 2018-04-13T09:50:50Z
[INFO] Final Memory: 59M/741M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-streaming_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: C
ompileFailed -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :spark-streaming_2.11
```

Should I create a JIRA ticket?

Author: Krisztián Szűcs <szucs.krisztian@gmail.com>

Closes apache#1890 from kszucs/ARROW-2452 and squashes the following commits:

224e9dd <Krisztián Szűcs> forward all arguments to docker-compose
1d34704 <Krisztián Szűcs> fix order of arguments
2641080 <Krisztián Szűcs> don't pass group- and userid to docker-compose
…d dictionaries

This introduces a scalar value class DictionaryValue, which fixes a couple bugs involving dictionaries nested inside of ListArrays or inside of other DictionaryArrays. This also includes a new test, which failed previous to this commit but now passes. See https://issues.apache.org/jira/browse/ARROW-2515.

This is my first time contributing, so feedback would be most welcome.

Author: Brent Kerby <blkerby@gmail.com>

Closes apache#1954 from blkerby/DictionaryValue and squashes the following commits:

1e06963 <Brent Kerby> ARROW-2515:  Add DictionaryValue class, fixing bugs with nested dictionaries
…and dictionary array

Implement missing Python properties for DictionaryType

Author: Marco Neumann <marco.neumann@blue-yonder.com>

Closes apache#1951 from crepererum/ARROW-2513 and squashes the following commits:

788179c <Marco Neumann> implement DictionaryType.dictionary
b703ecf <Marco Neumann> implement DictionaryType.index_type
COUNT_VALUES_CASE(BinaryType);
COUNT_VALUES_CASE(StringType);
COUNT_VALUES_CASE(FixedSizeBinaryType);
COUNT_VALUES_CASE(Decimal128Type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to promote some code reuse with the other unary (single-argument) hash kernels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, although I am not sure how to do it. Maybe moving everything to a macro? I am willing to try if somebody could give me some pointers on what's the best way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is redundant information since Null count is already in the array, but it might be good to either:

  1. Clarify in documentation that null elements are not included in counts
  2. Have an option to include them
  3. Always include them.
    (1 seems like a reasonable thing for now with a follow-up JIRA)

/// \note API not yet finalized
ARROW_EXPORT
Status CountValues(FunctionContext* context, const Datum& value,
std::shared_ptr<Array>* out_uniques,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems more natural to me to have the output type be a struct (but maybe there was discussion on this previously, I guess the existing API had this)?

@emkornfield
Copy link
Contributor

Hi @andrioni I'm trying to triage some old bugs and do first pass code reviews. Thanks for the contribution, I think with a rebase it looks ok to me. @xhochy it looks like this JIRA is assigned to you, was there any particular reason for that?

@xhochy
Copy link
Member

xhochy commented Jan 24, 2019

@emkornfield I once wanted to implement this but never got to it. :(

@wesm
Copy link
Member

wesm commented Jan 24, 2019

This PR may need quite a bit of work since @pitrou has revamped the hash kernels since this was written

@emkornfield
Copy link
Contributor

@andrioni do you think you would be able to update the pull request for this release? If not would you mind if it take over?

@emkornfield
Copy link
Contributor

@andrioni I'm going to try to pickup were you left off.

@emkornfield
Copy link
Contributor

I picked up this up in #3579

@andrioni
Copy link
Contributor Author

andrioni commented Feb 7, 2019

Sorry for the delay in answering, and thanks for picking it up!

@andrioni andrioni closed this Feb 7, 2019
wesm pushed a commit that referenced this pull request Mar 14, 2019
…ue frequencies

Picked up from: #1970

I mostly reused the unit tests as is and modified the rest based on feedback on that PR and adapting to new code.  Some other changes I mae:
1.  Remove the HashKernel and getter from the public API.  I think we can add these back once we have a better idea on how we are doing stateful kernels (i.e. #3407)
2. Add operator[] to NumericBuilder to modify previously added values (this might not be the right place, and I had a little bit of trouble figuring out how to integrate this into the existing TypedTest so the testing is a little bit weak).  This seemed better then using a vector to collect values.

If this approach looks OK for now.  I'm also going to open up a JIRA to try refactor the unittest for Hash kernels (and maybe the headers) since I think there might be a clearer more granular way of laying these out.

other things to discuss:
1.  Handling null values in Count/Unique (it looks like they are dropped today, should this configurable/turned on).
2.  Hashing edge cases for floating point numbers (just opened a JIRA on this).

Author: Micah Kornfield <emkornfield@gmail.com>
Author: Alessandro Andrioni <alessandroandrioni@gmail.com>

Closes #3579 from emkornfield/count_kernel and squashes the following commits:

9c55f7b <Micah Kornfield> make 64 bit
dd0d8a1 <Micah Kornfield> fix link and warning
72095eb <Micah Kornfield> Templatize whether to use return status
54afb2b <Micah Kornfield> change from std::copy to memcopy
2973fcc <Micah Kornfield> Address code review comments
e7e624b <Micah Kornfield> Add constants, per code review
4770d99 <Micah Kornfield> fix warning
c6f6ad7 <Micah Kornfield> address feedback
d99e52f <Micah Kornfield> add guard to CopyValue in cases where vector is empty
8c26b01 <Micah Kornfield> fix format
b7d5492 <Micah Kornfield> add null test
f964bd6 <Micah Kornfield> Rebase
e8e58a5 <Micah Kornfield> Address output type code review feedback
defb4f1 <Micah Kornfield> remove export from .cc
0152f2f <Micah Kornfield> plumb through status on hash visitors
afeb1ad <Micah Kornfield> add real jira
96858bd <Micah Kornfield> Use macro inversion to reduce boiler plate
0dd0077 <Micah Kornfield> minimal test
57349f7 <Micah Kornfield> unit tests passing
34834f7 <Alessandro Andrioni> First try at implementing a CountValues kernel
QuietCraftsmanship pushed a commit to QuietCraftsmanship/arrow that referenced this pull request Jul 7, 2025
…ue frequencies

Picked up from: apache/arrow#1970

I mostly reused the unit tests as is and modified the rest based on feedback on that PR and adapting to new code.  Some other changes I mae:
1.  Remove the HashKernel and getter from the public API.  I think we can add these back once we have a better idea on how we are doing stateful kernels (i.e. apache/arrow#3407)
2. Add operator[] to NumericBuilder to modify previously added values (this might not be the right place, and I had a little bit of trouble figuring out how to integrate this into the existing TypedTest so the testing is a little bit weak).  This seemed better then using a vector to collect values.

If this approach looks OK for now.  I'm also going to open up a JIRA to try refactor the unittest for Hash kernels (and maybe the headers) since I think there might be a clearer more granular way of laying these out.

other things to discuss:
1.  Handling null values in Count/Unique (it looks like they are dropped today, should this configurable/turned on).
2.  Hashing edge cases for floating point numbers (just opened a JIRA on this).

Author: Micah Kornfield <emkornfield@gmail.com>
Author: Alessandro Andrioni <alessandroandrioni@gmail.com>

Closes #3579 from emkornfield/count_kernel and squashes the following commits:

9c55f7ba6 <Micah Kornfield> make 64 bit
dd0d8a155 <Micah Kornfield> fix link and warning
72095ebc4 <Micah Kornfield> Templatize whether to use return status
54afb2bac <Micah Kornfield> change from std::copy to memcopy
2973fccbe <Micah Kornfield> Address code review comments
e7e624b1f <Micah Kornfield> Add constants, per code review
4770d9924 <Micah Kornfield> fix warning
c6f6ad72f <Micah Kornfield> address feedback
d99e52fb7 <Micah Kornfield> add guard to CopyValue in cases where vector is empty
8c26b0154 <Micah Kornfield> fix format
b7d54929a <Micah Kornfield> add null test
f964bd6da <Micah Kornfield> Rebase
e8e58a5b9 <Micah Kornfield> Address output type code review feedback
defb4f1a1 <Micah Kornfield> remove export from .cc
0152f2fa5 <Micah Kornfield> plumb through status on hash visitors
afeb1ad04 <Micah Kornfield> add real jira
96858bd52 <Micah Kornfield> Use macro inversion to reduce boiler plate
0dd007718 <Micah Kornfield> minimal test
57349f7ea <Micah Kornfield> unit tests passing
34834f711 <Alessandro Andrioni> First try at implementing a CountValues kernel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.