Add large_string_map data file by arthurpassos · Pull Request #38 · apache/parquet-testing

arthurpassos · 2023-05-31T16:47:46Z

Generated with below python script:

import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
arr = pa.chunked_array([arr, arr])
tab = pa.table({ "arr": arr })

pq.write_table(tab, "test.parquet", compression='BROTLI')

Required by apache/arrow#35825

mapleFU · 2023-06-12T12:59:08Z

The file is too large which is 96MB. Would you mind generate it just in your patch, or make it much more smaller?

wgtmac · 2023-06-12T14:21:14Z

I agree with @mapleFU that 96M is too large to be a test file. What about adding a roundtrip test directly in the arrow repo?

arthurpassos · 2023-06-12T14:55:03Z

@mapleFU @wgtmac you mean generate it in the test itself? If so, do you know how I can generate the equivalent to the below using the cpp API?

import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
arr = pa.chunked_array([arr, arr])
tab = pa.table({ "arr": arr })

pq.write_table(tab, "test.parquet")

wgtmac · 2023-06-12T15:02:32Z

https://github.com/apache/arrow/blob/ae655c5ccb8d4bec1acd0f6d50855a6dea1590c1/cpp/src/arrow/table_test.cc#L294

It may help, though a bit lengthy compared to the python code.

arthurpassos · 2023-06-12T16:47:23Z

https://github.com/apache/arrow/blob/ae655c5ccb8d4bec1acd0f6d50855a6dea1590c1/cpp/src/arrow/table_test.cc#L294

It may help, though a bit lengthy compared to the python code.

It's also not a complex struct like a map, so most likely it wouldn't even throw in this case

arthurpassos · 2023-06-12T17:47:21Z

@mapleFU @wgtmac I have used BROTLI compression and it now occupies only 4.22KB, that should be acceptable right?

wgtmac · 2023-06-13T01:27:38Z

The file size looks good. But it does not seems to be necessary to add a test file as all files in this repo are for interoperability across different parquet implementations.

mapleFU

I think this file is ok, but you should add discription for this file

arthurpassos · 2023-06-19T16:47:17Z

I think this file is ok, but you should add discription for this file

I added a line in the README

data/README.md

mapleFU

This looks ok to me, but I wonder if this file is required and description for arrow is ok in parquet library. @pitrou would you mind take a look?

wgtmac · 2023-06-21T03:00:38Z

data/README.md

 | rle-dict-snappy-checksum.parquet                 | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC |
 | plain-dict-uncompressed-checksum.parquet         | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
 | rle-dict-uncompressed-corrupt-checksum.parquet   | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC |
+| chunked_string_map.parquet             | Map(String, int32) containing string that won't fit arrow Binary. Asserts arrow LargeBinary can read it [Issue](https://github.com/apache/arrow/issues/32723) |


Some thoughts:

File name could be more clear, like large_string_map.brotli.parquet.

The link of github issue may be invalid in the future. What about adding a separate md file to include its generation script, file metadata (optional if the script is clear enough) and explain this issue in more detail?

arrow Binary -> arrow BinaryArray

arrow LargeBinary -> arrow LargeBinaryArray

Have you tried other codec like gzip (and higher levels)? It may help further reduce the file size.

Will rename the file

If that's a must, I'll try to do it

ok

ok

I took a quick look at this linkedin article that says BROTLI has the higher compression rates among all the compression types, see https://www.linkedin.com/pulse/comparison-compression-methods-parquet-file-format-saurav-mohapatra/. In any case, I just tried GZIP and it yields a 2085290 bytes file as opposed to BROTLI that produces a 4325 bytes file.

Regarding #2, there are some other files in this list that point to JIRA issues. This GH issue has an equivalent JIRA issue, maybe I can point to the JIRA one instead?

pitrou · 2023-06-21T13:07:54Z

@arthurpassos Thanks for doing this, and I agree that adding a test file here can be useful for other implementations as well. Why did you create a MAP node, though? Can we just have a regular string column?

arthurpassos · 2023-06-21T13:14:11Z

@arthurpassos Thanks for doing this, and I agree that adding a test file here can be useful for other implementations as well. Why did you create a MAP node, though? Can we just have a regular string column?

@pitrou Regular string column does not suffer from this issue. In a nutshell, it's an issue that pops up when an arrow ChunkedArray with more than one chunk is produced for columns of complex types like maps. You can find more info on: apache/arrow#32723

pitrou · 2023-06-21T13:36:06Z

@arthurpassos Makes sense, thanks!

pitrou · 2023-06-21T13:37:54Z

That said... we might go ahead and create several columns here:

a toplevel large_string column
a large_string_map column

The file will remain small anyway thanks to compress.

But we can also keep the file as-is if you'd prefer so.

wgtmac · 2023-06-21T14:54:10Z

That said... we might go ahead and create several columns here:

a toplevel large_string column

a large_string_map column

The file will remain small anyway thanks to compress.

But we can also keep the file as-is if you'd prefer so.

We have discussed in the PR and confirmed that primitive string column does not have any issues: apache/arrow#35825 (comment). But yes adding it here may benefit other implementations like rust to verify their capability.

arthurpassos · 2023-06-21T14:57:58Z

Can we keep it like this? I would like to get this merged sooner than later, I feel like this is something a bit out of the scope and can be addressed in the future

pitrou · 2023-06-21T14:58:21Z

@arthurpassos No problem, I'll merge then.

arthurpassos · 2023-06-21T15:02:16Z

@arthurpassos No problem, I'll merge then.

Thanks, but let's just wait until this discussion gets completed: apache/arrow#35825 (comment)

wgtmac · 2023-06-21T16:24:20Z

LGTM. Thanks @arthurpassos!

Sorry that it is a little bit late in my timezone. I just approved but not merged it in case there is any remaining issue. It would be great if @pitrou can take a final pass.

mapleFU · 2023-06-21T16:42:59Z

Thanks, but let's just wait until this discussion gets completed: apache/arrow#35825 (comment)

Whats the result of the discussion? Does it means that a RLE_DICTIONARY and a PLAIN for testing is required? Or just currently is ok?

arthurpassos · 2023-06-21T16:48:02Z

Thanks, but let's just wait until this discussion gets completed: apache/arrow#35825 (comment)

Whats the result of the discussion? Does it means that a RLE_DICTIONARY and a PLAIN for testing is required? Or just currently is ok?

This version is ok, ready to be merged I believe

mapleFU · 2023-06-21T16:58:31Z

I've verified:

{
  "Version": "2.6",
  "CreatedBy": "parquet-cpp-arrow version 11.0.0",
  "TotalRows": "2",
  "NumberOfRowGroups": "1",
  "NumberOfRealColumns": "1",
  "NumberOfColumns": "2",
  "Columns": [
     { "Id": "0", "Name": "arr.key_value.key", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} },
     { "Id": "1", "Name": "arr.key_value.value", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} }
  ],
  "RowGroups": [
     {
       "Id": "0",  "TotalBytes": "2147483827",  "TotalCompressedBytes": "3427",  "Rows": "2",
       "ColumnChunks": [
          {"Id": "0", "Values": "2", "StatsSet": "True", "Stats": {"NumNulls": "0" },
           "Compression": "BROTLI", "Encodings": "PLAIN(DICT_PAGE) PLAIN RLE_DICTIONARY", "UncompressedSize": "2147483749", "CompressedSize": "3346" },
          {"Id": "1", "Values": "2", "StatsSet": "True", "Stats": {"NumNulls": "0", "Max": "1", "Min": "1" },
           "Compression": "BROTLI", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "78", "CompressedSize": "81" }
        ]
     }
  ]

arthurpassos · 2023-06-21T17:00:01Z

@mapleFU Can this be merged then?

mapleFU · 2023-06-21T17:09:45Z

I've no permission to merge it, but seems that it was merged :)

add chunked_string_map data file

5ad6148

arthurpassos mentioned this pull request May 31, 2023

GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types apache/arrow#35825

Closed

use BROTLI compression for greater space saving

e332e18

mapleFU reviewed Jun 19, 2023

View reviewed changes

add description

f22e971

mapleFU reviewed Jun 20, 2023

View reviewed changes

data/README.md Outdated Show resolved Hide resolved

correct arrow type name

4369f3b

mapleFU reviewed Jun 20, 2023

View reviewed changes

wgtmac reviewed Jun 21, 2023

View reviewed changes

arthurpassos added 3 commits June 21, 2023 09:18

rename file as suggested by reviewers

0b407e2

update readme as suggested

af0afae

rename in docs as well

470a273

pitrou changed the title ~~Add chunked_string_map data file~~ Add large_string_map data file Jun 21, 2023

Make wording more precise, remove Arrow vocabulary

258509b

Add description of how the file was generated

aa0de26

Add link to paragraph

77f8bc9

wgtmac approved these changes Jun 21, 2023

View reviewed changes

mapleFU approved these changes Jun 21, 2023

View reviewed changes

pitrou merged commit d79a010 into apache:master Jun 21, 2023

Conversation

arthurpassos commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU commented Jun 12, 2023

Uh oh!

wgtmac commented Jun 12, 2023

Uh oh!

arthurpassos commented Jun 12, 2023

Uh oh!

wgtmac commented Jun 12, 2023

Uh oh!

arthurpassos commented Jun 12, 2023

Uh oh!

arthurpassos commented Jun 12, 2023

Uh oh!

wgtmac commented Jun 13, 2023

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

arthurpassos commented Jun 19, 2023

Uh oh!

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

wgtmac Jun 21, 2023

Choose a reason for hiding this comment

Uh oh!

arthurpassos Jun 21, 2023

Choose a reason for hiding this comment

Uh oh!

arthurpassos Jun 21, 2023

Choose a reason for hiding this comment

Uh oh!

pitrou commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Jun 21, 2023

Uh oh!

pitrou commented Jun 21, 2023

Uh oh!

pitrou commented Jun 21, 2023

Uh oh!

wgtmac commented Jun 21, 2023

Uh oh!

arthurpassos commented Jun 21, 2023

Uh oh!

pitrou commented Jun 21, 2023

Uh oh!

arthurpassos commented Jun 21, 2023

Uh oh!

wgtmac commented Jun 21, 2023

Uh oh!

mapleFU commented Jun 21, 2023

Uh oh!

arthurpassos commented Jun 21, 2023

Uh oh!

mapleFU commented Jun 21, 2023

Uh oh!

arthurpassos commented Jun 21, 2023

Uh oh!

mapleFU commented Jun 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arthurpassos commented May 31, 2023 •

edited

Loading

pitrou commented Jun 21, 2023 •

edited

Loading