GH-39680: [Java] enable half float support on Java module#39681
GH-39680: [Java] enable half float support on Java module#39681lidavidm merged 11 commits intoapache:mainfrom
Conversation
|
|
java/memory/memory-core/src/main/java/org/apache/arrow/memory/util/Float16.java
Show resolved
Hide resolved
| * where the value will be read from | ||
| * @return 4 byte float value | ||
| */ | ||
| public short getFloat16(long index) { |
There was a problem hiding this comment.
Given that setFloat16 takes a float, by symmetry this method should probably return a float too.
@danepitkin WDYT?
There was a problem hiding this comment.
Casting to float will involve a conversion to a 32 bit type. I think it would be nice to provide set/get operations on short as the core APIs and set/get operations on float as helper APIs. Thoughts?
There was a problem hiding this comment.
Well, getShort and setShort already seems provided?
There was a problem hiding this comment.
heh, good point.
I agree it makes sense to change this to float.
| // Positive infinity of type half-precision float. | ||
| private static final short POSITIVE_INFINITY = (short) 0x7c00; | ||
| // A Not-a-Number representation of a half-precision float. | ||
| private static final short NaN = (short) 0x7e00; |
There was a problem hiding this comment.
Should this and POSITIVE_INFINTY be public for app users to make use of?
| public class UnionReader extends AbstractFieldReader { | ||
|
|
||
| private BaseReader[] readers = new BaseReader[45]; | ||
| private BaseReader[] readers = new BaseReader[46]; |
There was a problem hiding this comment.
It'd be good if this constant was given a name.
|
@github-actions crossbow submit java-jars |
|
|
In order to test Float16 data types in Arrow Java Dataset module, apache/arrow-testing#99 is required. |
lidavidm
left a comment
There was a problem hiding this comment.
In general I think I'd favor using short as the 'default' float16 type and having float be the 'special'. Else the 'default' case has ambiguity when going from float -> float16.
| /* | ||
| The purpose of this method is to refresh the data in alltypes-java.parquet as required: | ||
| https://github.com/apache/arrow-testing/blob/master/data/parquet/alltypes-java.parquet | ||
| */ | ||
| public static void main(String[] args) throws Exception { | ||
| TestAllTypes test = new TestAllTypes(); | ||
| try (BufferAllocator allocator = new RootAllocator(); | ||
| VectorSchemaRoot root = test.generateAllTypesVector(allocator)) { | ||
| byte[] featherData = test.serializeFile(root); | ||
| try (SeekableByteChannel channel = new ByteArrayReadableSeekableByteChannel(featherData); | ||
| ArrowStreamReader reader = new ArrowStreamReader(channel, allocator)) { | ||
| TMP.create(); | ||
| final File writtenFolder = TMP.newFolder(); | ||
| final String writtenParquet = writtenFolder.toURI().toString(); | ||
| DatasetFileWriter.write(allocator, reader, FileFormat.PARQUET, | ||
| writtenParquet); | ||
| Objects.requireNonNull(writtenFolder.listFiles()); | ||
| Files.move(writtenFolder.listFiles()[0].toPath(), | ||
| Paths.get(writtenFolder.toPath().toString(), "alltypes-java.parquet")); | ||
| System.out.println("The file data/parquet/alltypes-java.parquet should be updated with this new data: " + | ||
| writtenFolder.listFiles()[0].toURI()); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Can you refactor this so that it doesn't duplicate the test itself? Also, technically this isn't really necessary; you can run the test in an IDE (and set a breakpoint so you can copy the temporary file)
There was a problem hiding this comment.
Sure, I will clean and delete this section.
java/dataset/pom.xml
Outdated
| <configuration> | ||
| <enableAssertions>false</enableAssertions> | ||
| <systemPropertyVariables> | ||
| <arrow.test.dataRoot>${project.basedir}/../../testing/data</arrow.test.dataRoot> |
There was a problem hiding this comment.
Where is this actually used? I don't see any new tests referencing it.
There was a problem hiding this comment.
The redundant part has already been added before, let me remove it
java/memory/memory-core/src/main/java/org/apache/arrow/memory/util/Float16.java
Show resolved
Hide resolved
| * where the value will be written | ||
| * @param value value to write | ||
| */ | ||
| public void setFloat16(long index, float value) { |
There was a problem hiding this comment.
IMO, these methods should work via shorts and not floats, since there is no conversion ambiguity with shorts.
| } | ||
| } | ||
|
|
||
| @Test /* Float2Vector */ |
java/memory/memory-core/src/main/java/org/apache/arrow/memory/util/Float16.java
Show resolved
Hide resolved
| ) { | ||
| buf.setFloat16(0, +32.875f); | ||
| assertEquals((short) 0x501c, Float16.toFloat16(+32.875f)); | ||
| assertEquals(Float16.toFloat((short) 0x501c), buf.getFloat16(0), 0); |
There was a problem hiding this comment.
Yeah, we should just use short in the first place.
…tAllTypes (#99) To be able to test Float16 data types into Arrow Java Dataset module. To fix dataset errors at: apache/arrow#39681
|
I've merged the testing PR. |
|
|
||
| /** | ||
| * Lifted from Apache Parquet MR project: | ||
| * https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/Float16.java |
There was a problem hiding this comment.
Er, use the URL that contains the commit hash, not 'main'
|
Just for the record, I will work on these items in another PR:
If there are any other items that need to be reviewed, please let me know. |
|
Do we have IPC integration tests for float16? I don't obviously see any |
If not, then perhaps it's worth opening a GH issue? |
|
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 2a87693. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them. |
@davisusanibar can you follow up here? |
…he#39681) ### Rationale for this change - To enable half float support on Java module. ### What changes are included in this PR? - [x] Add initial Float16 type support - [x] Unit test - [x] Integration test - [x] Documentation ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: apache#39680 Authored-by: david dali susanibar arce <davi.sarces@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
|
@github-actions crossbow submit -g java |
|
Revision: 19869ad Submitted crossbow builds: ursacomputing/crossbow @ actions-f0517a91b4 |
|
@github-actions crossbow submit java-jars |
|
Revision: 19869ad Submitted crossbow builds: ursacomputing/crossbow @ actions-62925ab8ab
|
…he#39681) ### Rationale for this change - To enable half float support on Java module. ### What changes are included in this PR? - [x] Add initial Float16 type support - [x] Unit test - [x] Integration test - [x] Documentation ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: apache#39680 Authored-by: david dali susanibar arce <davi.sarces@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
No