-
-
Notifications
You must be signed in to change notification settings - Fork 48
Closed
Milestone
Description
Working with parquet files is not as easy as when we are working with CSV files, mostly because it's hard to browse those files.
The idea is to build a simple CLI app, that could be released as a docker image.
We can start simple, with 2 commands:
- metadata
- data
bin/parquet read:metadata
bin/parquet read:data
For metadata, I already created a POC, here is the output:
$ ./parquet read:metadata --help
Usage:
./parquet_viewer.php [options] [--] <file>
Arguments:
file path to parquet file
Options:
--columns Display column details
--row-groups Display row group details
--column-chunks Display column chunks details
--page-headers Display page headers details
-h, --help Display help for the given command. When no command is given display help for the ./parquet_viewer.php command
-q, --quiet Do not output any message
-V, --version Display this application version
--ansi|--no-ansi Force (or disable --no-ansi) ANSI output
-n, --no-interaction Do not ask any interactive question
-v|vv|vvv, --verbose Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debugUsage example:
$ ./parquet.php read:metadata path/to/parquet.file --columns --row-groups --column-chunks --page-headers
----------------- -------------------- Metadata ---------------------------------------
file_path path/to/parquet.file
parquet_version 1
created_by flow-php parquet version 1.x
rows 1
----------------- ---------------------------------------------------------------------
-------------------------------------------- ------- Flat Columns ------ ------------ ---------------- ----------------
path type logical_type repetition max_repetition max_definition
-------------------------------------------- ------------ -------------- ------------ ---------------- ----------------
id INT32 - OPTIONAL 0 1
aMapField.key_value.someKey BYTE_ARRAY - REQUIRED 1 1
aMapField.key_value.aStructField.anInteger INT32 - OPTIONAL 1 2
aMapField.key_value.aStructField.aString BYTE_ARRAY - OPTIONAL 1 2
rootLevelStructField.anotherInteger INT32 - OPTIONAL 0 1
rootLevelStructField.anotherString BYTE_ARRAY - OPTIONAL 0 1
aListField.list.someInteger INT32 - OPTIONAL 1 2
-------------------------------------------- ------------ -------------- ------------ ---------------- ----------------
---------- ----- Row Groups ---------------
num_rows total_byte_size columns_count
---------- ----------------- ---------------
1 285 7
---------- ----------------- ---------------
-------------------------------------------- ------------------------ Column Chunks ------------- ------------ ------------------------ ------------------
path encodings compression file_offset num_values dictionary_page_offset data_page_offset
-------------------------------------------- ------------------------ ------------- ------------- ------------ ------------------------ ------------------
id [RLE,BIT_PACKED,PLAIN] GZIP 4 1 - 4
aMapField.key_value.someKey [RLE,BIT_PACKED,PLAIN] GZIP 56 2 - 56
aMapField.key_value.aStructField.anInteger [RLE,BIT_PACKED,PLAIN] GZIP 119 2 - 119
aMapField.key_value.aStructField.aString [RLE,BIT_PACKED,PLAIN] GZIP 179 2 - 179
rootLevelStructField.anotherInteger [RLE,BIT_PACKED,PLAIN] GZIP 243 1 - 243
rootLevelStructField.anotherString [RLE,BIT_PACKED,PLAIN] GZIP 295 1 - 295
aListField.list.someInteger [RLE,BIT_PACKED,PLAIN] GZIP 358 3 - 358
-------------------------------------------- ------------------------ ------------- ------------- ------------ ------------------------ ------------------
-------------------------------------------- -------------- -------- Page Headers ------ ------------------- ----------------------- -----------------
path type encoding compressed_size uncompressed_size dictionary_num_values data_num_values
-------------------------------------------- -------------- ---------- ----------------- ------------------- ----------------------- -----------------
id DATA_PAGE_V2 PLAIN 26 6 - 1
aMapField.key_value.someKey DATA_PAGE_V2 PLAIN 37 22 - 2
aMapField.key_value.aStructField.anInteger DATA_PAGE_V2 PLAIN 34 14 - 2
aMapField.key_value.aStructField.aString DATA_PAGE_V2 PLAIN 38 20 - 2
rootLevelStructField.anotherInteger DATA_PAGE_V2 PLAIN 26 6 - 1
rootLevelStructField.anotherString DATA_PAGE_V2 PLAIN 37 17 - 1
aListField.list.someInteger DATA_PAGE_V2 PLAIN 35 18 - 3
-------------------------------------------- -------------- ---------- ----------------- ------------------- ----------------------- -----------------read:data - this is a bit more tricky, but I think we can leverage ETL here and just read the file and display X number of results.
The ideal solution would be also to allow users to scroll through data using Pgdn/Pgup/up/down keys but it can come later, we can start easy.