First draft of docs about parquet format vs mr#53
Conversation
| * Utilities and APIs: It provides various utilities and APIs for working with Parquet files, including tools for data import/export, schema management, and data conversion. | ||
|
|
||
|
|
||
| ### Other Clients / Libraries / Tools |
There was a problem hiding this comment.
cc @tustvold @pitrou @emkornfield to see if we need to add well-known Parquet implementations in the arrow-rs and Apache Arrow C++ libraries at this moment.
There was a problem hiding this comment.
I think it would make sense to generally cover the parquet-cpp / arrow situation as this is also leading sometimes to a bit of confusion. The parquet part of the Arrow repository is actually part of the Parquet PMC, but back in the past we decided to merge it into Arrow as the Parquet C++ community was a subset of the Arrow C++ community and all development happened in the context of Arrow C++/Python.
There was a problem hiding this comment.
Added a list of the implementations that I'm aware of / that are referenced here below.
There was a problem hiding this comment.
I'm not sure it is crucial I added some more below. I think the discussion on the ML on how the parquet community views other implementations is something we should document when we come to a consensus.
content/en/docs/Overview/_index.md
Outdated
|
|
||
| The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. | ||
|
|
||
| Parquet MR can be thought of the a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, such as [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md). |
There was a problem hiding this comment.
Was this space intentional?
| Parquet MR can be thought of the a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, such as [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md). | |
| Parquet MR can be seen as the "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, such as [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md). |
Slight suggestion on the wording.
There was a problem hiding this comment.
Is parquet-mr actually the reference implementation, or would implementation in say parquet-cpp and parquet-rs be sufficient for standardisation of a feature? Are there already examples of this? Perhaps page indices which I believe were first implemented as part of Impala?
Perhaps it is beyond the scope of this issue, but I think the relationship between parquet-mr and parquet the specification is at the core of this issue, and the broader governance ambiguity surrounding parquet.
There was a problem hiding this comment.
For the record: parquet-mr and cudf were used as the two different PoC implementations in the vote for format change https://issues.apache.org/jira/browse/PARQUET-2261.
There was a problem hiding this comment.
For the record: page indices (aka column indexes) were implemented in Impala and parquet-mr at the same time (in the same room actually).
There was a problem hiding this comment.
I'll tweak the wording - haha you can probably see I was going back and forth about whether or not to call it "a" reference implementation or "the" reference implementation. also h/t to @gszadovszky who's wording I borrowed for this.
There was a problem hiding this comment.
To follow up -- we are discussing reference implementations on the mailing list: https://lists.apache.org/thread/f9379yx0lf5gtpkgyv922pvowtzy4kmm
content/en/docs/Overview/_index.md
Outdated
| Here is a non-exhaustive list of Parquet implementations: | ||
|
|
||
| * [parquet-mr](https://github.com/apache/parquet-mr) | ||
| * [parquet-cpp](https://github.com/apache/parquet-cpp) |
There was a problem hiding this comment.
| * [parquet-cpp](https://github.com/apache/parquet-cpp) | |
| * [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet) |
There was a problem hiding this comment.
Thanks @xhochy! For my own knowledge, is parquet-cpp effectively deprecated in favor of https://github.com/apache/arrow/tree/main/cpp/src/parquet?
There was a problem hiding this comment.
Yes. We state that at the top of the README but this is sadly quite often overlooked.
There was a problem hiding this comment.
I recommend you add one more commit that moves all files to a deprecated directory and only have a readme that says it has moved. That way the history is still easily accessible but nobody gets confused.
There was a problem hiding this comment.
I like @jacques-n 's suggestion or it may be worth just archiving the repo on github.
content/en/docs/Overview/_index.md
Outdated
| * [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) | ||
| * [cudf](https://github.com/rapidsai/cudf) | ||
| * [impala](https://github.com/apache/impala) | ||
| * [duckdb](https://github.com/duckdb/duckdb) No newline at end of file |
There was a problem hiding this comment.
I'm not sure wha the intent of this list is there are a couple of others off the top of my mind:
- arrow2 rust
- fastparquet python
- Parquet Go (also housed in Arrow mono-repo not part of the Parquet PMC)
- polars (rust) I believe forked arrow2
There was a problem hiding this comment.
Perhaps it is worth drawing a distinction between parquet implementations that are made for direct external consumption, e.g. fastparquet, parquet-cpp, parquet-rs, parquet-go and those that are implementation details of a broader engine - e.g. polars, duckdb, impala?
content/en/docs/Overview/format.md
Outdated
There was a problem hiding this comment.
Would this empty file fill the contents later?
There was a problem hiding this comment.
I initially had the plan to split out parquet-format and the client libs into separate pages, but decided against that
content/en/docs/Overview/_index.md
Outdated
|
|
||
| ### Parquet Format | ||
|
|
||
| The "Parquet Format" project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. |
There was a problem hiding this comment.
Let's please make spelling of the project name consistent: parquet-format not "Parquet Format".
content/en/docs/Overview/_index.md
Outdated
|
|
||
| The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. | ||
|
|
||
| Parquet-MR can be seen as a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, which are listed below. |
There was a problem hiding this comment.
Again, can we write parquet-mr consistently?
There was a problem hiding this comment.
"of the Parquet format", not "of parquet-format". We are implementing the spec, not its repository :-)
content/en/docs/Overview/_index.md
Outdated
| Parquet-MR can be seen as a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, which are listed below. | ||
|
|
||
| Included in parquet-mr: | ||
| * Java/Scala Implementation: It contains the core Java/Scala implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. |
There was a problem hiding this comment.
Is there actually some Scala code in it?
There was a problem hiding this comment.
There was a problem hiding this comment.
Perhaps we should just say Java implementation here. The scala code is just for filters and we don't have a full scala implementation.
content/en/docs/Overview/_index.md
Outdated
| Here is a non-exhaustive list of Parquet implementations: | ||
|
|
||
| * [parquet-mr](https://github.com/apache/parquet-mr) | ||
| * [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet) |
There was a problem hiding this comment.
| * [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet) | |
| * [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html)) |
content/en/docs/Overview/_index.md
Outdated
| * [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet) | ||
| * [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) | ||
| * [cudf](https://github.com/rapidsai/cudf) | ||
| * [impala](https://github.com/apache/impala) |
|
@pitrou - Updated with per your feedback |
alamb
left a comment
There was a problem hiding this comment.
Thank you @vinooganesh -- I think this is a great start and we should merge this PR and iterate on any remaining content.
content/en/docs/Overview/_index.md
Outdated
|
|
||
| The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. | ||
|
|
||
| The parquet-mr repo contains a "reference" implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. |
There was a problem hiding this comment.
I don't think there is any reason to add quotes
| The parquet-mr repo contains a "reference" implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. | |
| The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. |
There was a problem hiding this comment.
I would second the removal of the quotes here.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
@wgtmac - given consensus here, would you be able to merge? |
|
Thanks - and just to make sure it's clear, my main goal was to start the process of actually documenting the institutional knowledge in the community and this PR is mostly intended as a starting point. There are some other much meatier topics (parquet v2's definition for example) that will need to be discussed in follow up PRs. |
I think documenting the current / institutional knowledge is superful helpful. Thank you for pushing this forward |
pitrou
left a comment
There was a problem hiding this comment.
+1 in general, but need to polish terminology and spelling.
content/en/docs/Overview/_index.md
Outdated
|
|
||
| ### parquet-mr | ||
|
|
||
| The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. |
There was a problem hiding this comment.
Can we please make the terminology consistent? Either describe both parquet-format and parquet-mr as "projects" or as "GitHub repositories", but not one and the other.
There was a problem hiding this comment.
I can make this change. These are referred to publicly as both projects and repo (in our mailing list as well) so I deliberately put both in. I'll stick with repository though.
content/en/docs/Overview/_index.md
Outdated
|
|
||
| The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. | ||
|
|
||
| The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. |
There was a problem hiding this comment.
It's "a" reference implementation, it means there are other ones. But I don't see them listed here.
Either list all reference implementations explicitly, or make this "the" reference implementation.
There was a problem hiding this comment.
If we knew what those references implementations are, I agree it would be valuable to document.
However, I think there is consensus required before we made such a determination
Thus, for this PR I suggest:
- Remove the word "reference"
- File a follow on ticket / discussion in the mailing list to figure out what should be listed as references implementations
| The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. | |
| The parquet-mr repo contains an implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. |
There was a problem hiding this comment.
There have been a lot of conversations about this: #53 (comment) (and others on the thread) and I'm inclined to keep this as is. I don't think we need to exhaustively list other the reference reference implementing when there is a list of implementations below. @gszadovszky has also called this a reference implementation and I think it helps clarify the relationship between the parquet-format and parquet-mr. I'm more than happy to update this once the community has reached a consensus after the mailing list discussion that @alamb suggested though.
There was a problem hiding this comment.
If we don't agree on what a reference implementation then we should not list parquet-mr as a reference implementation. The term "reference implementation" has an official connotation and implies a specific status; it certainly should not be assigned lightly.
There was a problem hiding this comment.
cc @gszadovszky @Fokko @xhochy @wgtmac. I don't want to further block this PR by settling this beforehand, so I'm going to remove the word "reference" and we can add it back if we want to in a subsequent PR.
There was a problem hiding this comment.
I started a thread on the mailing list about this topic to see if we can reach consensus: https://lists.apache.org/thread/f9379yx0lf5gtpkgyv922pvowtzy4kmm
content/en/docs/Overview/_index.md
Outdated
|
|
||
| ### parquet-mr | ||
|
|
||
| The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. |
There was a problem hiding this comment.
Also, can we explain what "mr" stands for? It's a mystery for most people.
There was a problem hiding this comment.
@pitrou I assume it's mapreduce, but please correct me if I'm wrong
content/en/docs/Overview/_index.md
Outdated
|
|
||
| ### parquet-format | ||
|
|
||
| The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. |
content/en/docs/Overview/_index.md
Outdated
|
|
||
| * [parquet-mr](https://github.com/apache/parquet-mr) | ||
| * [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html)) | ||
| * [parquet go](https://github.com/apache/arrow/tree/main/go/parquet) |
There was a problem hiding this comment.
Can we make sure all these project names/references are appropriately spelled and capitalized?
There was a problem hiding this comment.
And could also mention for Go (and similarly for rust below) that it is a subproject of Arrow Go, similarly like done for C++ above
(it also seems there are several Parquet Go implementations, others that are not part of the Arrow project, so it's good to clarify this one is Arrow related. But at the same time it's not entirely clear what the criterium is for being listed here ..)
There was a problem hiding this comment.
I recommend that (as a follow on PR) we turn this list into a table, something like
| Project | Language | Website | API Docs |
|---|---|---|---|
| parquet-mr | Java | link | |
| parquet-cpp | C++ | link | link |
| parquet-rs | Rust | link | link |
Also I recommend the criteria for being listed here is "Open source implementations of the parquet format" (which is a low bar to be sure
I would be happy to propose such changes as a follow on PR
There was a problem hiding this comment.
FYI this work is tracked by https://issues.apache.org/jira/browse/PARQUET-2310 and there is a draft at #34. Once this PR gets merged we'll start working on that
content/en/docs/Overview/_index.md
Outdated
| * [parquet go](https://github.com/apache/arrow/tree/main/go/parquet) | ||
| * [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) | ||
| * [cudf](https://github.com/rapidsai/cudf) | ||
| * [apache impala](https://github.com/apache/impala) |
content/en/docs/Overview/_index.md
Outdated
| * [cudf](https://github.com/rapidsai/cudf) | ||
| * [apache impala](https://github.com/apache/impala) | ||
| * [duckdb](https://github.com/duckdb/duckdb) | ||
| * [fast-parquet python](https://github.com/dask/fastparquet) |
There was a problem hiding this comment.
It's "fastparquet". Perhaps "fastparquet, a Python implementation of the Parquet format".
|
Given the size and substance of this PR, it is unlikely we will get it perfect the first time. Also, I don't see any disagreement across commenters on the value of this information. I would personally suggest that we address all the outstanding comments as best as possible, merge this PR, and then iterate on the content in subsequent PRs |
etseidl
left a comment
There was a problem hiding this comment.
One last spelling nit 😉. Thank you @vinooganesh for starting this documentation project.
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
|
I just merged it. Thanks everyone! |
|
Thanks everyone! |



@wgtmac @Fokko @gszadovszky @shangxinli @jacques-n
This is a hopefully non-controversial v0 explanation of the difference between the parquet-format and parquet-mr repos. Always happy to provide more depth and open to feedback here.
It doesn't include any of the "v2" stuff that we were discussing on the email chain, but that can be a quick follow up PR if people think including that would be helpful.