GH-3235: Row count limit for each row group#3236
Conversation
|
|
||
| --- | ||
|
|
||
| **Property:** `parquet.block.row.count.limit` |
There was a problem hiding this comment.
since there is an existing parquet.block.size
| private final WriteSupport<T> writeSupport; | ||
| private final MessageType schema; | ||
| private final Map<String, String> extraMetaData; | ||
| private final long rowGroupSize; |
|
cc @wgtmac and @nandorKollar |
| >= recordCountForNextMemCheck) { // checking the memory size is relatively expensive, so let's not do it | ||
| // for every record. | ||
| if (recordCount >= rowGroupRecordCountThreshold) { | ||
| LOG.debug("record count reaches threshold: flushing {} records to disk.", recordCount); |
There was a problem hiding this comment.
Maybe better to escalate the log level to INFO? It should not be noisy.
There was a problem hiding this comment.
Maybe it would not be noisy to raise the level to INFO but what purpose does it serve? If we want to answer why a Parquet file looks like how it does (the number/size of row groups etc.), the logs of the file creation are probably long gone.
There was a problem hiding this comment.
For a typical Hadoop YARN cluster, which serves Spark workloads, application logs are preserved for a few days by collecting and aggregating to HDFS. Anyway, it's not a big deal, we can always use parquet-cli to analyze the suspicious Parquet files :)
There was a problem hiding this comment.
Yeah, what I was thinking of, you write a Parquet file, and then, you read it somewhere else months/years later. You won't have the related logs for sure.
It is a very separate topic, but it would be a good idea to write these properties directly into the footer so if we have any issues with a file, at least we would know, how it was created.
|
Velox Parquet writer also has such a configuration |
|
kindly ping @wgtmac @gszadovszky, could you please take a look? thank you in advance |
Rationale for this change
Similar to ORC-1172, see background at #3235
What changes are included in this PR?
A new configuration
parquet.block.row.count.limitis added.Are these changes tested?
UT is added.
Also verified by integrating with Spark, for example, to set the row count threshold to 100000000 for each row group in Spark
Are there any user-facing changes?
Yes, a new conf is added.
Closes #3235