Conversation
|
The file is too large which is 96MB. Would you mind generate it just in your patch, or make it much more smaller? |
|
I agree with @mapleFU that 96M is too large to be a test file. What about adding a roundtrip test directly in the arrow repo? |
|
@mapleFU @wgtmac you mean generate it in the test itself? If so, do you know how I can generate the equivalent to the below using the cpp API? |
|
It may help, though a bit lengthy compared to the python code. |
It's also not a complex struct like a map, so most likely it wouldn't even throw in this case |
|
The file size looks good. But it does not seems to be necessary to add a test file as all files in this repo are for interoperability across different parquet implementations. |
mapleFU
left a comment
There was a problem hiding this comment.
I think this file is ok, but you should add discription for this file
I added a line in the README |
data/README.md
Outdated
| | rle-dict-snappy-checksum.parquet | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC | | ||
| | plain-dict-uncompressed-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC | | ||
| | rle-dict-uncompressed-corrupt-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC | | ||
| | chunked_string_map.parquet | Map(String, int32) containing string that won't fit arrow Binary. Asserts arrow LargeBinary can read it [Issue](https://github.com/apache/arrow/issues/32723) | |
There was a problem hiding this comment.
Some thoughts:
- File name could be more clear, like
large_string_map.brotli.parquet. - The link of github issue may be invalid in the future. What about adding a separate md file to include its generation script, file metadata (optional if the script is clear enough) and explain this issue in more detail?
- arrow Binary -> arrow BinaryArray
- arrow LargeBinary -> arrow LargeBinaryArray
- Have you tried other codec like gzip (and higher levels)? It may help further reduce the file size.
There was a problem hiding this comment.
- Will rename the file
- If that's a must, I'll try to do it
- ok
- ok
- I took a quick look at this linkedin article that says
BROTLIhas the higher compression rates among all the compression types, see https://www.linkedin.com/pulse/comparison-compression-methods-parquet-file-format-saurav-mohapatra/. In any case, I just triedGZIPand it yields a 2085290 bytes file as opposed toBROTLIthat produces a 4325 bytes file.
There was a problem hiding this comment.
Regarding #2, there are some other files in this list that point to JIRA issues. This GH issue has an equivalent JIRA issue, maybe I can point to the JIRA one instead?
|
@arthurpassos Thanks for doing this, and I agree that adding a test file here can be useful for other implementations as well. Why did you create a MAP node, though? Can we just have a regular string column? |
@pitrou Regular string column does not suffer from this issue. In a nutshell, it's an issue that pops up when an arrow ChunkedArray with more than one chunk is produced for columns of complex types like maps. You can find more info on: apache/arrow#32723 |
|
@arthurpassos Makes sense, thanks! |
|
That said... we might go ahead and create several columns here:
The file will remain small anyway thanks to compress. But we can also keep the file as-is if you'd prefer so. |
We have discussed in the PR and confirmed that primitive string column does not have any issues: apache/arrow#35825 (comment). But yes adding it here may benefit other implementations like rust to verify their capability. |
|
Can we keep it like this? I would like to get this merged sooner than later, I feel like this is something a bit out of the scope and can be addressed in the future |
|
@arthurpassos No problem, I'll merge then. |
Thanks, but let's just wait until this discussion gets completed: apache/arrow#35825 (comment) |
|
LGTM. Thanks @arthurpassos! Sorry that it is a little bit late in my timezone. I just approved but not merged it in case there is any remaining issue. It would be great if @pitrou can take a final pass. |
Whats the result of the discussion? Does it means that a |
This version is ok, ready to be merged I believe |
|
I've verified: |
|
@mapleFU Can this be merged then? |
|
I've no permission to merge it, but seems that it was merged :) |
Generated with below python script:
Required by apache/arrow#35825