PARQUET-2470: Update website with larger ecosystem emphasis#59
PARQUET-2470: Update website with larger ecosystem emphasis#59wgtmac merged 5 commits intoapache:productionfrom
Conversation
|
+1! |
etseidl
left a comment
There was a problem hiding this comment.
+1 Hadoop not required 😄
content/en/docs/Overview/_index.md
Outdated
| Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. | ||
| Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. | ||
| It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. | ||
| Parquet is available in multiple languages including Java, C++, and Python. |
There was a problem hiding this comment.
Echoing @amoeba, perhaps leave out specific languages and leave it vague.
There was a problem hiding this comment.
I agree it is strange to have this mention of specific technologies -- maybe we can make all three locations consistent (and more general)
There was a problem hiding this comment.
I think mentioning implementation (both as end-user software and as libs) is valuable but shouldn't be part of the elevator pitch. Other formats usually solve this by a dedicated sub-section or page, e.g.:
- https://jpeg.org/jpegxl/software.html (the list format is good, the fact that there's only a single implementation is not)
- https://paseto.io/
- https://autocrypt.org/dev-status.html
This would also allow multiple implementations for a single language, which sometimes can be valuable (e.g. if you have a backwards compatible, conservative variant and a fancy new one).
There was a problem hiding this comment.
I agree 100% -- I believe we are beginning to create just such a list on #53
This set of examples is good. I have added it to https://issues.apache.org/jira/browse/PARQUET-2310 which tracks these examples
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
julienledem
left a comment
There was a problem hiding this comment.
This looks great. Thank you for taking the initiative. Hadoop is not required indeed. Perhaps at some point we should rename parquet-mr to parquet-java?
alamb
left a comment
There was a problem hiding this comment.
Per the feedback here https://github.com/apache/parquet-site/pull/59/files#r1599769911 I have updated the text in all three places to be
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
It provides high performance data compression and encoding schemes to handle complex data in bulk.
From my perspective this PR is now ready to merge
Thanks everyone for the reviews and comments
|
|
||
| Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. | ||
| Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. | ||
| It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools. |
There was a problem hiding this comment.
Did we mean for this to say "high performance compression" or is it "high performance, compression"? I think it may be the latter. Or maybe "It provides performant compression and encoding schemes..." I was thinking the first versions sound too much like the compression tool rather than the format
There was a problem hiding this comment.
I didn't mean for the comma or lack there of to carry any additional semantic meaning. I am happy to put a comma there if you like
There was a problem hiding this comment.
No really strong feelings, was just wondering if there was a subtextual focus intended
|
Let me merge this. Thanks everyone! |
|
Thanks @wgtmac |
|
Thanks! |

Rationale
As described on https://issues.apache.org/jira/browse/PARQUET-2470, Parquet's role in the analytics ecosystem is substantial.
However, https://parquet.apache.org/ currently emphasis Parquet's role in the Hadoop ecosystem. I think this causes confusion in several ways:
Changes
Update the home page content to mirror the Apache Project Description https://projects.apache.org/project.html?parquet (which does not mention Hadoop specifically)
Before this PR
After the PR